Literature DB >> 17984073

EPGD: a comprehensive web resource for integrating and displaying eukaryotic paralog/paralogon information.

Guohui Ding¹, Yan Sun, Hong Li, Zhen Wang, Haiwei Fan, Chuan Wang, Dan Yang, Yixue Li.

Abstract

Gene duplication is common in all three domains of life, especially in eukaryotic genomes. The duplicates provide new material for the action of evolutionary forces such as selection or genetic drift. Here we describe a sophisticated procedure to extract duplicated genes (paralogs) from 26 available eukaryotic genomes, to pre-calculate several evolutionary indexes (evolutionary rate, synonymous distance/clock, transition redundant exchange clock, etc.) based on the paralog family, and to identify block or segmental duplications (paralogons). We also constructed an internet-accessible Eukaryotic Paralog Group Database (EPGD; http://epgd.biosino.org/EPGD/). The database is gene-centered and organized by paralog family. It focuses on paralogs and evolutionary duplication events. The paralog families and paralogons can be searched by text or sequence, and are downloadable from the website as plain text files. The database will be very useful for both experimentalists and bioinformaticians interested in the study of duplication events or paralog families.

Entities: Chemical Disease Species

Mesh：

Substances：
Proteins

Year: 2007 PMID： 17984073 PMCID： PMC2238967 DOI： 10.1093/nar/gkm924

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The occurrences and consequences of gene and genome duplication events have been discussed for a long time (1,2). The duplication of genes and large genome regions (or entire genomes) is proposed to be an important mechanism for the evolution of phenotypic complexity, diversity and innovation, and as an origin of novel gene functions. To uncover the evolutionary trajectories of duplicated genes, previous studies have integrated transcriptomic, interactomic and other data (1). Such integrated approaches, focusing on gene duplications in genomes, have already contributed robust insights into important evolutionary questions, such as the complexity of genes (3), the evolution of genome architecture (4), growth of gene networks (5), the 2R hypothesis (6) and diversity of gene expression (7). Moreover, the duplicated genes can be used to investigate diverging gene functions, which, when allied with computational methods, may provide useful information for experimental approaches. An example is the analysis of the molecular basis of the adaptive evolution of the duplicated pancreatic ribonuclease gene in leaf-eating monkeys with both computational and experimental approaches (8). As more genomes are examined, increasing evidences support the dominating role of gene duplication events in the expanding of genome content (2,9). A crucial step in the study of gene duplications is to identify duplicated genes (known as paralogs) in genome sequences and to distinguish these from genes that have similar sequence but arisen from convergent evolution or other mechanisms. Algorithm-based homology detection from primary sequences is the preferred approach to detect paralogs or paralogous regions (4). In contrast to ortholog databases, there are only a few specific paralog databases available in the public domain. Even though several general homolog databases, such as Inparanoid (10), Ensembl Compara (11), NCBI homologene (12), include some paralog information, they did not comprehensively summarize and display the evolution information of paralogs. In order to construct a stable web resource that supports easy browsing and downloading of evolutionary information on paralogous genes, we created EPGD (Eukaryotic Paralog Group Database; http://epgd.biosino.org/EPGD/). Several steps used to identify the paralogs contained in the EPGD were used previously to detect the duplication events in the family of animal transmembrane genes (13). Using this work (13) as a basis, we developed a semi-automatic procedure for collecting the within-species paralog families from genomes and pre-calculating several evolutionary indexes of these families. We collected the paralogs only from eukaryotes, as they are known to have a higher rate of gene duplication than Prokaryotes (14) and are more widely studied in this field. A pioneer in the construction of paralog database is paraDB (15). A highlight of paraDB is the display of paralogons, which have been thoroughly investigated in the human genome (16) and are reviewed by Van de Peer (4). EPGD inherits this feature and adopts the term ‘paralogon’, defined as homologous genomic segments created by partial or complete genome duplication. EPGD focuses on families of paralogs and integrates spatial and temporal data to diagnose gene duplication processes comprehensively (17). The ratio of dN (the rate of non-synonymous substitutions) to dS (the rate of synonymous substitutions) (18), synonymous distance/clock, transition redundant exchange (TREx) clock (19), paralogons and several other features were generated by computational methods and deposited in the database. In the current EPGD version, 26 eukaryotic genomes were processed and 35 991 paralog families and 29 480 paralogons were identified and stored (Table 1). To our knowledge, it is one of the most extensive paralog databases in public domain. All data can be browsed, searched and downloaded directly from the website.

Table 1.

Summary of the content in EPGD

Species	TaxID	Paralog	Gene	Paralogon	Ratio^a	Family	Family size^b
Plasmodium falciparum	36 329	494	5365	433	0.09	90	5.4889
Kluyveromyces Lactis	284 590	539	5504	50	0.1	206	2.6165
Cryptococcus neoformans	214 684	736	6617	94	0.11	252	2.9206
Apis mellifera	7460	1223	9430	58	0.13	371	3.2965
Dekaryomyces Hansenii	284 592	992	7081	109	0.14	334	2.9701
Candida glabrata	284 593	756	5534	72	0.14	304	2.4868
Yarrowia lipolytica	284 591	1056	7180	317	0.15	294	3.5918
Schizosaccharomyces pombe	284 812	815	5374	119	0.15	302	2.6987
Encephalitozoon cuniculi	284 813	312	2029	161	0.15	87	3.5862
Aspergillus fumigatus	330 879	1573	10 157	470	0.15	504	3.121
Anopheles gambiae	180 454	2169	13 748	521	0.16	565	3.8389
Bos taurus	9913	4995	28 806	541	0.17	1232	4.0544
Danio rerio	7955	6765	38631	1014	0.18	1915	3.5326
Saccharomyces cerevisiae	4932	1269	6198	484	0.2	473	2.6829
Drosophila melanogaster	7227	3130	14 838	568	0.21	773	4.0492
Macaca mulatta	9544	6579	29 122	1189	0.23	1826	3.603
Pan troglodytes	9598	7147	31 482	1913	0.23	1944	3.6764
Tribolium castaneum	7070	2335	9837	344	0.24	549	4.2532
Gallus gallus	9031	5017	19 828	883	0.25	1500	3.3447
Canis familiaris	9615	6065	20 053	1443	0.3	1671	3.6296
Caenorhabditis Elegans	6239	6528	21 052	1139	0.31	1331	4.9046
Homo sapiens	9606	10 962	33 610	2134	0.33	3445	3.182
Mus musculus	10 090	14 592	41 323	2705	0.35	3390	4.3044
Rattus norvegicus	10 116	12 959	35 786	2234	0.36	3387	3.8261
Arabidopsis thaliana	3702	15573	32025	9581	0.49	3590	4.3379
Strongylocentrotus purpuratus	7668	15773	30552	904	0.52	5656	2.7887

aRatio of the duplicated genes to all genes.

bAverage family size in genes.

Summary of the content in EPGD aRatio of the duplicated genes to all genes. bAverage family size in genes.

CONSTRUCTION AND CONTENT

EPGD is implemented through MySQL relational database (http://www.mysql.com) and JavaServer Pages technology (http://java.sun.com/products/jsp/). The raw datasets of 26 eukaryotic genomes (Table 1) in GeneBank flat file format (GBK) were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nih.gov/genomes) in March 2007. Proteins, coding sequences (CDS) and gene location information were extracted from these GBK files with a PERL script.

Overview of the procedure

A total of 531 715 coding sequences and corresponding proteins were obtained after preprocessing. Only the protein sequences were used to construct the paralog families. The procedure is briefly described below: Pairwise alignments of the proteins using gapped BLAST (20), with filtering for low sequence complexity regions using SEG (21). The default parameters were used, except for the threshold E-value of 10−5. Definition of the homologous genes. Four criteria must be satisfied. (a) all high-scoring segment pairs (HSPs) in the target sequence have to be arranged in the same order as in the query protein sequence (22); (b) the remaining HSPs cover more than 80% of the protein length; (c) the similarity of each HSPs is more than 50% (two amino acids are considered similar if their BLOSUM62 similarity score is positive) (22) and (d) these conditions are symmetrical for both genes. Single linkage clustering of homologous genes (13). Generation of the primary paralog families. Mapping the proteins to gene loci. Paralog families with at least two gene loci were retained. Multiple alignment of the proteins in each retained family. Clustalw (version 1.83) (23) was applied in this step. Codon-level multiple alignment with the CDS in each family by using RevTrans (version 1.4) (24). Calculations of the evolutionary indexes. dN and dS were calculated with the Nei and Gojobori (25) and the Yang and Nielsen methods (26), which were carried out using yn00 from the PAML (Phylogenetic analysis by maximum likelihood) packages (27). The TREx distances were computed based on the definition (19): the fractional identity of silent sites in conserved 2-fold redundant codon sites, which was implemented by ourselves. Construction of the arithmetic average (UPGMA) trees for grouping the proteins in a paralog family. These trees were derived from the dS matrix, because the synonymous substitutions are thought to be approximative neutral molecular markers. Identification of the paralogons using the algorithm developed by McLysaght et al. (16). Paralogons are two genomic segments that share a set of paralogous genes (4,16). After tandem duplications were masked, a greedy search algorithm was used to identify all paralogons between all pairs of chromosomes, based only on gene content but not gene order (4). Two criteria must be satisfied for a pair of paralogons. (a) they should contain at least two pairs of paralogous genes; (b) the gap size between two neighboring paralogous points in either chromosome should be less than the average length of 30 genes (16).

Content in the database

Large datasets were obtained when the procedure was applied to 26 genomes. We housed the data in a MySQL relational database. The kernel tables in the schema of EPGD are the table of paralog families and the table of paralogons. The peripheral tables, i.e. evolutionary indexes and annotation information, surround these two core tables. A summary of the data in EPGD is shown in Table 1.

Web interface

The web interface was implemented using Java and JavaServer Pages technologies. The user can inspect the datasets in the EPGD and see a summary of the current version. The records of paralog families, paralogons and genes (Figure 1) are randomly selected each time when ‘Glance’ page is visited (http://epgd.biosino.org/EPGD/glance.jsp).

Figure 1.

Web pages for gene record (A), paralog family (B) and paralogon region (C). (A) Example of a gene record for H. sapiens. The gene record web page consists of three segments: basic information, paralogon links and coding sequences. Through paralogon links, paralogons ‘including’ or ‘covering’ this gene can be accessed. (B) Example of a paralog family. Gene list, multi-alignment and pre-calculated evolutionary indexes can be obtained from this page. The user can visualize the multi-alignment via JalView (28). In addition, an UPGMA tree is built and rendered with a Java applet. (C) Paralogon region with a highlighted gene (colored red). Several basic properties (average block length, average block density, number of links) are displayed in the page. In the paralogon figures, the paralogs in these regions are connected with lines. Each gene in these figures is linked to the gene record in database. As shown in Figure 1, if the gene record is obtained, the corresponding paralog family and paralogons can be linked from this page. The main content of the gene page (Figure 1A) starts with basic information of this gene (NCBI gene ID, taxonomy, EPGD family ID, location in the chromosome and simple description), followed by EPGD paralogons, which include or cover this gene. We defined that a gene is ‘included’ in a paralogon if it has at least one corresponding paralog in this paralogon region (paralogon-defining gene), while a gene is ‘covered’ by a paralogon if it does not have any corresponding paralog in this paralogon region (paralog-intervening gene). The coding sequences of the gene are listed at the bottom of the page. The outline of the family page is similar to that of gene page (Figure 1B). Multi-aligned sequences in protein or codon level, pre-calculated evolution indexes [dN, dS, TREx (19), etc.] and UPGMA tree based on dS are displayed on this page. The multi-alignments can be viewed in plain text or be displayed with the Jalview alignment viewer (28) (Figure 1). In the page which is hyperlinked from ‘Evolution indexes of Pairwise CDSs’, a row with a dN/dS different from the neutral expectation of 1 (z score > 1.96 or z score < −1.96) is color coded orange (Figure 1). The z score is computed using equation (18) where z is the z score, dN is the rate of non-synonymous substitutions, dS is the rate of synonymous substitutions, SEd and SEd are the standard errors of dN and dS, and Cov(dN, dS) is the covariance of dN and dS. We assume that the non-synonymous substitutions and the synonymous substitutions are independent and set Cov(dN, dS) to zero (18). The main part of the paralogon page contains basic information (taxonomy, locations in the chromosomes, average block length, average block density, number of links) of the paralogon, followed by an image thumbnail displaying a graphic view of the paralogon. Here, ‘the average block density’ is the arithmetic mean of the ratio of paralogon-defining genes to all genes in both sides of the paralogon; ‘number of links’ is the number of unique paralog families linked in the paralogon region. When the mouse hovers over this thumbnail, an enlarged view of this image pops up. Gene names and their regions in the enlarged graphic view of this paralogon are hyperlinked to the gene records in database. The user can access the records in the EPGD with customized queries (Figure 2). From the ‘iSearch’ webpage (Figure 2A), ‘any text’ and nucleic acid or protein sequences can be searched without setting any parameter. Advanced Search pages with numerous input options (Figure 2B and C) can be accessed via the links (‘Advanced Text Search’ or ‘Advanced Sequence Search’) from ‘iSearch’ page. The sequence search is powered by NCBI Blast package (20). Each search returns a result list of records in the database, which provides the hyperlinks to detailed pages (Figure 2D).

Figure 2.

Database searching. (A) Quick search for ‘any text’ or sequences. (B) Advanced text search. NCBI Gene ID, member ID, paralog family ID, paralogon ID, gene symbol and any word in the gene description can be applied as search fields. (C) Advanced sequence search by NCBI BLAST (20). (D) Query result with a navigation bar.

DATA AVAILABILITY

The EPGD is available for download through the ‘DOWNLOADS’ link in the website as a FASTA file containing all proteins, family members lists, evolutionary indexes and paralogon regions in plain text files.

RESULTS AND DISCUSSION

The properties of the paralog family spaces in EPGD

Table 1 gives a summary of the content of the current EPGD version. The proportions of duplicated genes in eukaryotes collected by EPGD range from 9% (Plasmodium falciparum) to 52% (Strongylocentrotus purpuratus), and are smaller than previously reported (e.g. Homo sapiens, 38%; Arabidopsis thaliana, 65%; Drosophila melanogaster, 41%; Caenorhabditis elegans, 49%; Saccharomyces cerevisiae, 30%) (2). This is due to the rigorous criteria for paralog definition used to construct the EPGD and because many duplicated genes have eliminated characteristic signatures from their sequences during their evolution history (2). Since evolutionary indexes are highly unreliable for ancient gene duplications, rigorous criteria are essential for our database. The size of the paralog families tends to be smaller than five genes. The distributions of paralog family size in all species of EPGD follow power law (data not shown) (29,30). As an example, Figure 4A displays the distribution of paralog family sizes in H. sapiens and the corresponding log–log diagram. The power law distribution indicates the robustness of our family detection method and the quality of gene prediction in the original data (29).

Figure 4.

Statistics of the paralog families in H. sapiens. (A) Frequency distribution of the sizes of the paralogon families and the corresponding log–log diagram. Note that the families with more than 17 gene members were omitted in this plot and that the largest family is olfactory receptor family, which possesses 377 genes. In the log–log diagram, the logarithms of these two variables fit the linear model (r = −0.8191, P = 1.013 × 10−9). (B) Negative correlation between TREx distance and dS. The points with dS < 1.0 were used in this panel. The correlation coefficient of these two variables is −0.8916967 (P < 2.2 × 10−16). The line generated with least squares fit has a slope of −0.3051738. (C) dN as a function of dS. Data points are divided into two groups, black points denoting gene pairs for which the ratio dN/dS is not significantly different from the neutral expectation of 1 (−1.96 < z Score < 1.96) and green points denoting gene pairs whose dN/dS is different from the neutral expectation of 1 (z Score > 1.96 or z Score < −1.96). The dashed line denotes dN = dS. (D) Frequency distribution of sizes of paralogons, which are defined as the number of linked families in this region.

Consistent with previous studies on Bacteria and a small set of Eukarya (9,29,31), large genomes possess more paralog families and a higher proportion of genes belonging to paralog families than small genomes (Figure 3A and C). We find, however, only a weak correlation between the average size of families and the genome sizes (Figure 3B, r = 0.26, P = 0.19), in contrast to the finding in Bacteria that average family size increases with genome size (31). This result suggests that the higher percentage of paralogs in large eukaryotic genome stems mainly from the emergence of new paralogon families. An expansion of existing gene families is not evident in Eukarya (Figure 3B).

Figure 3.

Number of families (A), average size of families (B), ratio of paralogs (C) and number of paralogons (D) in different genomes. Number of genes denotes the size of a genome, r is the correlation coefficient and P is P-value. The number of paralogons increases with the genome size (Figure 3D, r = 0.86, P = 3.356 × 10−8), indicating the effect of duplication of large genome segments on the evolution of genome size. Furthermore, the distribution of the paralogon size is also a skewed distribution (e.g. Figure 4D). Most of the paralogons have less than five linked familes (98% of all human paralogons), because of the high level of gene loss after duplication, as well as recombination, chromosomal rearrangements and recombination. Still, the identification of putative paralogons provides many insights into evolutionary mechanisms (4). Statistics of the paralog families in H. sapiens. (A) Frequency distribution of the sizes of the paralogon families and the corresponding log–log diagram. Note that the families with more than 17 gene members were omitted in this plot and that the largest family is olfactory receptor family, which possesses 377 genes. In the log–log diagram, the logarithms of these two variables fit the linear model (r = −0.8191, P = 1.013 × 10−9). (B) Negative correlation between TREx distance and dS. The points with dS < 1.0 were used in this panel. The correlation coefficient of these two variables is −0.8916967 (P < 2.2 × 10−16). The line generated with least squares fit has a slope of −0.3051738. (C) dN as a function of dS. Data points are divided into two groups, black points denoting gene pairs for which the ratio dN/dS is not significantly different from the neutral expectation of 1 (−1.96 < z Score < 1.96) and green points denoting gene pairs whose dN/dS is different from the neutral expectation of 1 (z Score > 1.96 or z Score < −1.96). The dashed line denotes dN = dS. (D) Frequency distribution of sizes of paralogons, which are defined as the number of linked families in this region.

The example of H. sapiens

Taking H. sapiens as an example (Figure 4), we plotted the distribution of paralog family size (Figure 4A), a scatter diagram of TREx distance versus dS (Figure 4B), a log–log graph of dN versus dS (Figure 4C) and the distribution of paralogon size (the number of linked families) (Figure 4D). Transition redundant exchange (TREx) processes at the position of conserved 2-fold codon sites are thought to offer an approximation for a neutral molecular clock (19). We calculated the TREx distances for each paralog family, which provide a more homogeneous molecular clock than that provided by the dS. If the time since two genes diverged is long relative to the reciprocal of the rate constant with which these silent sites suffer transition substitutions, the TREx distance approximates 0.5. As seen from Figure 4B, TREx distances are negatively correlated with dS (Figure 4B, r = −0.89, P < 2.2 × 10−16). Therefore, the TREx distance can be used as an alternative of dS. Similar to the work of Lynch et al. (32), dN was plotted as a function of dS (Figure 4C). The accumulation of non-neutral points when dS increases (Figure 4C) confirms the gradual increase of selective constraint on duplicates during evolutionary history (32). When dS is greater than 2, there are more points around the neutral expectation (Figure 4C). This is an artifact, resulting from the saturation effects in the estimation of dN and dS (33).

PERSPECTIVES

We plan to update EPGD every six months. As new eukaryotic organisms are fully sequenced and annotated, they will be added to EPGD using our procedure. In the future, ortholog annotation information will also be included. However, the development of the utilities for EPGD will still focus on tools for the analysis of duplication events, such as statistical tests of the paralogons (unpublished data) and chromosome ideograms. Furthermore, we will thoroughly analyze the data in EPGD and present insights into the effect of duplication events on genome evolution. The procedure to build the EPGD is currently semi-automatic. We will make the procedure totally automatic and start an open source project in the future.

30 in total

1. Lineage-specific gene expansions in bacterial and archaeal genomes.

Authors: I K Jordan; K S Makarova; J L Spouge; Y I Wolf; E V Koonin
Journal: Genome Res Date: 2001-04 Impact factor: 9.043

2. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models.

Authors: Z Yang; R Nielsen
Journal: Mol Biol Evol Date: 2000-01 Impact factor: 16.240

3. The evolutionary fate and consequences of duplicate genes.

Authors: M Lynch; J S Conery
Journal: Science Date: 2000-11-10 Impact factor: 47.728

Review 4. Are we polyploids? A brief history of one hypothesis.

Authors: W Makalowski
Journal: Genome Res Date: 2001-05 Impact factor: 9.043

Review 5. Interpretive proteomics--finding biological meaning in genome and proteome databases.

Authors: Steven A Benner
Journal: Adv Enzyme Regul Date: 2003

6. Decoupled evolution of coding region and mRNA expression patterns after gene duplication: implications for the neutralist-selectionist debate.

Authors: A Wagner
Journal: Proc Natl Acad Sci U S A Date: 2000-06-06 Impact factor: 11.205

7. Extensive genomic duplication during early chordate evolution.

Authors: Aoife McLysaght; Karsten Hokamp; Kenneth H Wolfe
Journal: Nat Genet Date: 2002-05-28 Impact factor: 38.330

8. Adaptive evolution of a duplicated pancreatic ribonuclease gene in a leaf-eating monkey.

Authors: Jianzhi Zhang; Ya-ping Zhang; Helene F Rosenberg
Journal: Nat Genet Date: 2002-03-04 Impact factor: 38.330

9. ParaDB: a tool for paralogy mapping in vertebrate genomes.

Authors: Magalie Leveugle; Karine Prat; Nadine Perrier; Daniel Birnbaum; François Coulier
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

10. Database resources of the National Center for Biotechnology Information.

Authors: David L Wheeler; Tanya Barrett; Dennis A Benson; Stephen H Bryant; Kathi Canese; Vyacheslav Chetvernin; Deanna M Church; Michael DiCuccio; Ron Edgar; Scott Federhen; Lewis Y Geer; Yuri Kapustin; Oleg Khovayko; David Landsman; David J Lipman; Thomas L Madden; Donna R Maglott; James Ostell; Vadim Miller; Kim D Pruitt; Gregory D Schuler; Edwin Sequeira; Steven T Sherry; Karl Sirotkin; Alexandre Souvorov; Grigory Starchenko; Roman L Tatusov; Tatiana A Tatusova; Lukas Wagner; Eugene Yaschenko
Journal: Nucleic Acids Res Date: 2006-12-14 Impact factor: 16.971

11 in total

1. Molecular phylogeny and functional genomics of beta-galactoside alpha2,6-sialyltransferases that explain ubiquitous expression of st6gal1 gene in amniotes.

Authors: Daniel Petit; Anne-Marie Mir; Jean-Michel Petit; Christine Thisse; Philippe Delannoy; Rafael Oriol; Bernard Thisse; Anne Harduin-Lepers
Journal: J Biol Chem Date: 2010-09-20 Impact factor: 5.157

2. Accurate distinction of pathogenic from benign CNVs in mental retardation.

Authors: Jayne Y Hehir-Kwa; Nienke Wieskamp; Caleb Webber; Rolph Pfundt; Han G Brunner; Christian Gilissen; Bert B A de Vries; Chris P Ponting; Joris A Veltman
Journal: PLoS Comput Biol Date: 2010-04-22 Impact factor: 4.475

3. Duplication and maintenance of the Myb genes of vertebrate animals.

Authors: Colin J Davidson; Erin E Guthrie; Joseph S Lipsick
Journal: Biol Open Date: 2012-11-06 Impact factor: 2.422

4. Comparing the retention mechanisms of tandem duplicates and retrogenes in human and mouse genomes.

Authors: Zhen Wang; Xiao Dong; Guohui Ding; Yixue Li
Journal: Genet Sel Evol Date: 2010-06-28 Impact factor: 4.297

5. Transcriptome analysis reveals the time of the fourth round of genome duplication in common carp (Cyprinus carpio).

Authors: Jin-Tu Wang; Jiong-Tang Li; Xiao-Feng Zhang; Xiao-Wen Sun
Journal: BMC Genomics Date: 2012-03-19 Impact factor: 3.969

6. An easy-to-use primer design tool to address paralogous loci and T-DNA insertion sites in the genome of Arabidopsis thaliana.

Authors: Gunnar Huep; Nils Kleinboelting; Bernd Weisshaar
Journal: Plant Methods Date: 2014-09-13 Impact factor: 4.993