| Literature DB >> 16733546 |
Christine Vogel1, Cyrus Chothia.
Abstract
During the course of evolution, new proteins are produced very largely as the result of gene duplication, divergence and, in many cases, combination. This means that proteins or protein domains belong to families or, in cases where their relationships can only be recognised on the basis of structure, superfamilies whose members descended from a common ancestor. The size of superfamilies can vary greatly. Also, during the course of evolution organisms of increasing complexity have arisen. In this paper we determine the identity of those superfamilies whose relative sizes in different organisms are highly correlated to the complexity of the organisms. As a measure of the complexity of 38 uni- and multicellular eukaryotes we took the number of different cell types of which they are composed. Of 1,219 superfamilies, there are 194 whose sizes in the 38 organisms are strongly correlated with the number of cell types in the organisms. We give outline descriptions of these superfamilies. Half are involved in extracellular processes or regulation and smaller proportions in other types of activity. Half of all superfamilies have no significant correlation with complexity. We also determined whether the expansions of large superfamilies correlate with each other. We found three large clusters of correlated expansions: one involves expansions in both vertebrates and plants, one just in vertebrates, and one just in plants. Our work identifies important protein families and provides one explanation of the discrepancy between the total number of genes and the apparent physiological complexity of eukaryotic organisms.Entities:
Mesh:
Year: 2006 PMID: 16733546 PMCID: PMC1464810 DOI: 10.1371/journal.pcbi.0020048
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Figure 1Motivation and Outline of the Analysis
(A) The number of genes and eukaryotic complexity are uncorrelated. The figure displays for 38 eukaryotic genomes the estimated number of different cell types [28,29] in relation to the predicted total number of genes. The tree indicates, in a simplified form, the phylogenetic relationships between the organisms as taken from the National Center of Biotechnology Information (NCBI) taxonomy server (http://www.ncbi.nlm.nih.gov/Taxonomy). The order of the organisms is the same in all figures and tables; their major groups are: plants (green), protozoa (blue), fungi (black), and animals (red and brown). The correlation between the number of different cell types and the number of genes is poor (R = 0.29, R = 0.54).
Within the plants, we distinguish green algae (Cre, Chlamydomonas reinhardtii), and flowering plants (Osa, O. sativa; Ath, Arabidopsis thaliana). We include eight protozoa (Ddi, Dictyostelium discoideum; Tbr, Trypanosoma brucei; Lma, Leishmania major; Pra, Phytophthora ramorum; Tps, Thalassiosira pseudonana; Ehi, Entamoeba histolytica; Tan, Theileria annulata; Pfa, Plasmodium falciparum), and ten fungi (Ncr, Neurospora crassa; Eni, Emericella nidulans; Spo, Schizosaccharomyces pombe; Sce, S. cerevisiae; Kla, Kluyveromyces lactis; Cal, Candida albicans; Yli, Yarrowia lipolytica; Ecu, Encephalitozoon cuniculi; Pch, Phanerochaete chrysosporium; Uma, Ustilago maydis). Protostomia include two nematodes (Cbr, Caenorhabditis briggsae; Cel, C. elegans), and three insects (Ame, Apis mellifera; Aga, Anopheles gambiae; Dme, D. melanogaster). Deuterostomia include one urochordate (Cin, Ciona intestinalis), and 11 vertebrates, among which six are mammals (Dre, Danio rerio; Tni, Tetraodon nigroviridis; Tru, Takifugu rubripes; Xtr, Xenopus tropicalis; Gga, Gallus gallus; and Cfa, Canis familiaris; Bta, Bos taurus; Rno, Rattus norvegicus; Mmu, Mus musculus; Ptr, Pan troglodytes; and Hsa, H. sapiens, respectively).
(B) Outline of our analysis. For each of the 38 genomes (three, symbolised by circles), we collected information on the number of proteins (lines with boxes) that contain domains of particular superfamilies (boxes of particular colour). The resulting abundance profiles were normalised and compared both to the estimated number of different cell types in each organism, and to each other. Analysis of function of particular groups of domain superfamilies gives information on how their expansion in some organisms may have supported an increase in organismal complexity.
Few Domain Superfamilies Correlate Well with the Number of Different Cell Types
Figure 2Some Family Expansions Correlate Well with the Number of Different Cell Types in Each Organism
For each of the 1,219 domain superfamilies and their profile of abundance in the 38 genomes, we calculated the correlation coefficient R of the profile with the number of different cell types per organism. The distribution of R values is plotted in black. For the subset of largest superfamilies (i.e., those with at least 25 proteins in one of the genomes) the distribution of R values is shown in red. There are few superfamilies with high correlation (R ≥ 0.80), and many with poor correlation or slight anticorrelation (R ≤ 0.20); this distribution is similar for both sets of superfamilies.
Contribution of Different Groups of Domain Superfamilies to the Overall Composition of Genomes
Figure 3Examples of Family Expansions with Good or Poor Correlation with the Number of Different Cell Types
There are 194 superfamilies with good (R ≥ 0.80; [A]) and 555 superfamilies with poor or negative (R ≤ 0.20; [B]) correlation with the number of different cell types, and the diagrams shows 15 examples of each. Some of the peaks are annotated in italics. The genomes are in the same order as in Figure 1A. The lines between counts of domain abundance are for better visualisation only. Abbreviations are as in Figure 1.
Domain Families with Good Correlation with the Number of Different Cell Types
Figure 4Domain Superfamilies Show Different Expansion Patterns
The matrix shows the 299 largest domain superfamilies that occur in ≥25 proteins in at least one of the genomes, hierarchically clustered. Each row represents one superfamily. Colour-coded profiles show the normalised abundance of each domain superfamily across the different eukaryotic genomes: white, low relative abundance; blue, high relative abundance. Each column represents one genome. All genomes are abbreviated and organised as in Figure 1A. A grouping of superfamily pairs with R ≥ 0.90 results in 26 clusters, and the three largest clusters are indicated in red boxes: expansions in vertebrates (52 superfamilies) and expansions in plants (33 superfamilies), and expansions in vertebrates and plants (26 superfamilies). Further descriptions can be found in Table 4 and at http://polaris.icmb.utexas.edu/people/cvogel/HV.
Patterns of Domain Superfamily Expansions