Literature DB >> 20215432

Transposases are the most abundant, most ubiquitous genes in nature.

Ramy K Aziz1, Mya Breitbart, Robert A Edwards.   

Abstract

Genes, like organisms, struggle for existence, and the most successful genes persist and widely disseminate in nature. The unbiased determination of the most successful genes requires access to sequence data from a wide range of phylogenetic taxa and ecosystems, which has finally become achievable thanks to the deluge of genomic and metagenomic sequences. Here, we analyzed 10 million protein-encoding genes and gene tags in sequenced bacterial, archaeal, eukaryotic and viral genomes and metagenomes, and our analysis demonstrates that genes encoding transposases are the most prevalent genes in nature. The finding that these genes, classically considered as selfish genes, outnumber essential or housekeeping genes suggests that they offer selective advantage to the genomes and ecosystems they inhabit, a hypothesis in agreement with an emerging body of literature. Their mobile nature not only promotes dissemination of transposable elements within and between genomes but also leads to mutations and rearrangements that can accelerate biological diversification and--consequently--evolution. By securing their own replication and dissemination, transposases guarantee to thrive so long as nucleic acid-based life forms exist.

Entities:  

Mesh:

Substances:

Year:  2010        PMID: 20215432      PMCID: PMC2910039          DOI: 10.1093/nar/gkq140

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Since life first emerged, organisms have been struggling for survival and competing over the finite resources within their ecosystems (1,2). This struggle for survival not only is confined to the organism level but it also applies to individual genes (3) and even non-coding DNA segments (4,5). As a corollary, a gene’s success can be determined by its ability to persist in nature and to be spread throughout genomes and biomes (6). For this to take place, genes need some sequence plasticity to adapt to different environments while retaining enough sequence conservation to preserve the structure of their encoded proteins and the identity of their encoded biological functions (7,8). Every time a new genome is sequenced, many genes are identified and annotated based on their homology to sequences available in databases, but new genes with novel functions are also identified, adding to the universal gene pool. To date, no study has systematically and directly surveyed the millions of protein-encoding genes (PEGs) deposited in sequence databases to identify their relative prevalence. There have been several challenges to such an endeavor: (i) the absence of numerical parameters to assess a gene’s prevalence; (ii) the lack of fair representation of the tree of life within available sequence data (9,10) and (iii) the difficulty of defining what is meant by ‘same gene’ in different organisms and ecosystems. To overcome these difficulties, (i) we calculated both the abundance and ubiquity of all known biological functions encoded in genomes and ecosystems to estimate their prevalence, with the assumption that these values will be correlated with gene fitness; (ii) we surveyed both genomic and metagenomic data sets to reduce bias caused by the uneven sampling of the tree of life in genomic data sets; and (iii) we defined similar genes as those encoding proteins with similar specific biological functions. In some instances, this definition could be regarded as an oversimplification, notably in cases of convergent evolution or homoplasy, where multiple genes of different origin evolve to perform similar biological functions. However, the majority of current gene annotations are specific enough to distinguish many instances of paralogous genes or different classes within gene/protein families. It is also understandable that different genes are under different selection pressures, as some are forced to endure mutations and tolerate sequence variability to escape pressure (e.g. bacterial genes encoding immunogenic proteins that are under pressure of the host immune system and genes encoding surface proteins that are easily recognized by predators) while others are under strict sequence conservation pressure (e.g. genes encoding housekeeping enzymes and essential biological functions). Importantly, in determining gene prevalence we distinguished between ubiquity and abundance. Ubiquity is one of the indicators of essentiality, while abundance without ubiquity is an indicator of adaptive, organism-specific or habitat-specific functionality. In other words, ubiquitous genes are assumed to be those that carry essential functions and are thus indispensable in every genome (elements of core genomes) or every ecosystem (eco-essential genes). On the other hand, genes that are overly abundant in few ecosystems and absent in others are likely to play essential habitat-specific roles (e.g. photosynthesis, anaerobic metabolism, detoxification, etc.). Contenders for the ‘fittest gene’ title include the gene encoding ribulose-1,5-bisphosphate carboxylase/oxygenase (RuBisCO), an enzyme that plays a critical role in the fixation of carbon dioxide via the Calvin cycle and that has been touted as the single most successful, most abundant enzyme on the planet (11). Genes encoding ribosomal proteins are also plausible candidates. However, those are largely limited to cellular life forms, are not essential and almost absent in viruses (12), and are divergent between eukaryotes and prokaryotes. Additionally, DNA polymerase genes and other genes involved in DNA synthesis and nucleotide/nucleoside metabolism (e.g. ribonucleotide reductases, RNR) are essential for DNA-based life and are not restricted to cellular organisms, being found in viral genomes as well. Their essentiality favors them as strong competitors; yet, they are often present at one or few copies per genome. To our surprise, none of the previous candidate genes topped the list of the most abundant, most ubiquitous genes. Instead, our analysis singled out genes encoding transposases as the most abundant genes in genomes and metagenomes, and the most ubiquitous in metagenomes.

ANALYSIS OF GENOMES AND METAGENOMES

To determine the most abundant non-hypothetical PEG, we examined almost 10 million annotated genes or gene tags: over 3.2 million PEGs in fully sequenced viral, bacterial, archaeal and eukaryotic genomes (2137 genomes on 1 May 2009) and over 6.7 million environmental gene tags (EGT)—with significant matches to known proteins—in 187 random community genomes (metagenomes). For functional assignments, we mostly relied on the annotations available in the SEED database (13) because it uses subsystems-based controlled vocabulary curated by human experts and automatically propagated among genomes (14). For consistency, the same SEED subsystems were used for the annotation of all metagenomic data sets described in this study (15).

Analysis of complete genome sequences

We screened 2137 complete genomes (47 archaeal, 725 bacterial, 29 eukaryotic and 1336 viral genomes at the time this study was performed) available in the SEED database (URL: http://seed-viewer.theseed.org) and identified 37 258 PEGs (1.163% of all PEGs) annotated as transposase-related. Out of these, 26 625 (0.825% of all PEGs) were explicitly annotated as ‘transposases’, 360 were annotated as ‘degenerate transposases’, and then there were a variety of insertion sequence-related transposases, which may or may not be functional. Even when these ambiguous annotations were excluded from the final counts, transposases remained the most abundant PEGs in the completely sequenced genomes (Figure 1).
Figure 1.

Abundance of different functional roles in 2137 genomes plotted against the ubiquity of these functional roles (defined as the number of genomes in which the functional role is represented at least once). r, Pearson’s product moment correlation between abundance and ubiquity; Cys, cysteine; Thio, thioredoxin; ThioR, thioredoxin reductase. Proteins annotated solely based on their location or posttranslational modification but not their biological functions (e.g. membrane proteins, cytoplasmic proteins, secreted proteins, transmembrane proteins and generic lipoproteins) were excluded; an exception was the ‘outer membrane protein’ annotation as it describes specific bacterial proteins rather than protein localization.

Abundance of different functional roles in 2137 genomes plotted against the ubiquity of these functional roles (defined as the number of genomes in which the functional role is represented at least once). r, Pearson’s product moment correlation between abundance and ubiquity; Cys, cysteine; Thio, thioredoxin; ThioR, thioredoxin reductase. Proteins annotated solely based on their location or posttranslational modification but not their biological functions (e.g. membrane proteins, cytoplasmic proteins, secreted proteins, transmembrane proteins and generic lipoproteins) were excluded; an exception was the ‘outer membrane protein’ annotation as it describes specific bacterial proteins rather than protein localization. These data imply that out of a set of 2000 randomly sampled genes (the average number of genes in a typical bacterial genome), 22 genes are expected to encode transposases, at least 16 of which are likely functional. Obviously, genomes that have transposase genes tend to have them in multiple copies; this explains why although two-thirds of sequenced genomes (mostly viral) lack known functional transposases, the average number of transposases—when present—is 38.42 per genome (Table 1 and Supplementary Table S1). This observation is also in agreement with reports that transposases are unequally distributed among bacterial genomes, with higher abundance in facultative pathogens and free-living bacteria than in obligate pathogens and endosymbionts (16), and with extraordinarily high numbers in some species, e.g. Crocosphaera watsonii (17,18).
Table 1.

The 20 most abundant non-hypothetical protein-encoding genes in all sequenced genomes

RankFunctional rolenG CountVABEC/n%
1Transposase69315 (1.1%)31 (66%)630 (86.9%)17 (58.6%)38.420.83
26 6252173625 226642
2ABC transporter, ATP-binding protein7381 (<1%)39 (83%)682 (94.1%)16 (55.2%)12.710.29
938212648998119
3Sensor histidine kinase57422 (46.8%)550 (75.9%)2 (6.9%)9.710.17
557529452765
4DNA-binding response regulator57813 (27.7%)562 (77.5%)3 (10.3%)8.200.15
47083346696
5Methyl-accepting chemotaxis protein4081 (<1%)15 (31.9%)391 (53.9%)1 (3.4%)10.760.14
438916443186
6ABC transporter, permease protein58033 (70.2%)545 (75.2%)2 (6.9%)7.550.14
4,37713742382
7Glycosyltransferase (EC 2.4.1.-)64941 (87.2%)598 (82.5%)10 (34.5%)6.430.13
4172287386322
8Transcriptional regulator, LysR family4418 (17%)430 (59.3%)3 (10.3%)9.150.13
403710401710
9Transcriptional regulator, TetR family53514 (29.8%)521 (71.9%)6.930.12
3709453664
10Acetyltransferase, GNAT family48019 (40.4%)458 (63.2%)3 (10.3%)7.330.11
351653345310
11Transcriptional regulator, AraC family4591 (<1%)7 (14.9%)450 (62.1%)1 (3.4%)7.370.11
33821733731
12Long-chain-fatty-acid–CoA ligase (EC 6.2.1.3)58931 (66%)533 (73.5%)25 (86.2%)5.080.09
2995682728199
13Transcriptional regulator, MarR family54623 (48.9%)522 (72%)1 (3.4%)5.320.09
29057128313
14Permeases of the major facilitator superfamily3931 (<1%)12 (25.5%)375 (51.7%)5 (17.2%)6.950.09
273312227019
15Acetyltransferase (EC 2.3.1.-)5594a (<1%)22 (46.8%)532 (73.4%)5 (17.2%)4.360.08
243645723745
16Cysteine desulfurase (EC 2.8.1.7)78336 (76.6%)722 (99.6%)25 (86.2%)3.020.07
236266223957
173-oxoacyl-[acyl-carrier protein] reductase (EC 1.1.1.100)70627 (57.4%)665 (91.7%)14 (48.3%)2.800.06
197568186344
18Integrase53470 (5.2%)11 (23.4%)448 (61.8%)5 (17.2%)3.430.06
18297019172911
19Outer membrane protein4151 (<1%)10 (21.3%)402 (55.4%)2 (6.9%)4.340.06
180311217864
20Permease of the drug/metabolite transporter (DMT) superfamily51828 (59.6%)486 (67%)4 (13.8%)3.370.05
17465316885

nG: number of genomes in which the functional role is present at least once; Count: number of genes in all sequenced genomes; V, A, B, E: viruses, archaea, bacteria, eukarya, respectively; C/n: average number of genes per positive genome; %: percentage of genes to the total number of genes in all genomes (n = 3 204 918).

aAcetyltransferase-like proteins that were missed in the automated analysis.

The 20 most abundant non-hypothetical protein-encoding genes in all sequenced genomes nG: number of genomes in which the functional role is present at least once; Count: number of genes in all sequenced genomes; V, A, B, E: viruses, archaea, bacteria, eukarya, respectively; C/n: average number of genes per positive genome; %: percentage of genes to the total number of genes in all genomes (n = 3 204 918). aAcetyltransferase-like proteins that were missed in the automated analysis. While the abundance of transposase genes in microbial genomes has been recognized for long time, only recently has it been exploited for inferring microbial cohabitation patterns and lateral gene transfer (19). Next to transposases, the most abundant functional roles in all sequenced genomes include ABC transporters, transcriptional regulators of different families, signal transduction kinases, chemotaxis proteins, acetyl- and glycosyl- transferases and cysteine desulfurase (Table 1 and Supplementary Table S1). On the other hand, the most ubiquitous functional roles in sequenced genomes are encoded by low-copy-number genes that consequently have a low overall abundance. Only four out of the 100 most ubiquitous functional genes have a mean copy number >2 per genome. These are genes encoding thioredoxin reductase; thioredoxin; cysteine desulfurase and the ABC transporter, ATP-binding protein (Supplementary Table S2). The list of most ubiquitous functional roles in genomes was topped by tRNA synthetases (Figure 1), and other genes associated with protein synthesis and post-translational protein sorting (e.g. translation elongation factor and preprotein translocase, Supplementary Table S2).

Analysis of metagenomic sequences

In spite of the striking prevalence and high copy numbers of transposase genes in fully sequenced genomes, the use of those data sets is prone to biases. The available genomes unevenly represent the tree of life as they mostly correspond to cultured organisms from just four bacterial phyla (9). Moreover, there is an over-representation of microbes of interest to humans (20), such as bacterial pathogens and microbes used in agriculture or industry (21). Finally, while viruses are at least 10 times as abundant as bacteria in nature (22,23), sequenced viral genomes are lagging behind both in terms of numbers (∼2:1 viral to bacterial genome ratio) and annotation quality (most encoded proteins are of unknown functions). In contrast, analysis of community genomes (metagenomes) offers a less-biased representation of life forms and biological functions in various habitats. The term ‘metagenome’ describes the collective genomes found in a particular ecosystem (24,25). Since the first uncultured viral community genomic sequences were published in 2002 (26), metagenomics has emerged as a rapid and efficient method of identifying not only the species present in a given ecosystem but also the ecosystem-associated metabolic signatures or patterns (27–31). The emergence of low-cost, high-throughput next-generation sequencing technologies (32–37) has enabled the quick implementation of metagenomics in the analysis of different environments, allowing an unprecedented view of biodiversity (25,38–42). Over the past few years, metagenomic sequencing has been used to explore a wide range of environments, encompassing various marine ecosystems (28,43–47), hydrothermal vents (48,49), corals (50–52), salterns (53,54), soil (55–57), sludge (58), mines (59), human and animal guts (60–64) and lungs (65), microbialites (66,67) and even mosquitoes (31). Metagenomic analysis is shifting the paradigm from organism/genome-centric to gene-centric and pathway-centric approaches to understanding biodiversity (68,69). Several bioinformatics and statistical tools allow the metabolic reconstruction of a particular ecosystem by enumerating EGTs in metagenomes and binning them either phylogenetically or biochemically (15,68,70–73), as well as the comparison of multiple metagenomes (59,74,75). In this study, we followed a gene-centric approach by enumerating EGTs, and estimating the abundance and ubiquity of their different functional roles in 187 different metagenomic samples representing a broad range of environments. Assessing EGT abundance in metagenomic data is slightly different from determining PEG frequency in fully sequenced genomes. In genomes, a single, full-length copy of a gene reflects a single occurrence of that gene in one cell of an organism. In metagenomic data, multiple occurrence of an EGT can be attributed to either multiple copies of the same gene, multiple orthologs (from different genomes), multiple paralogs or just multiple sequences covering different parts of the exact same DNA segment. Moreover, the coding sequence length is a potential confounding factor: longer genes are more likely to be sampled by random sequencing (unless the sample is large enough to provide 100% coverage). For those reasons, the frequency of each EGT was normalized to the mean length of the most similar proteins [from BLASTX (76) results] to generate an abundance index, which was further divided by the number of informative sequence reads (those sequence tags matching annotated proteins in known databases) to generate a normalized abundance index (see the legend of Table 2 for more details).
Table 2.

The 20 most abundant functional roles in metagenomes

RankFunctional rolenMGnCAI
1Transposase1784026.17
2Retrotransposon-related p150 protein693412.12
3Viral structural protein1261909.75
4ABC transporter, ATP-binding protein1701528.03
5Replication-associated protein321481.67
6Photosystem II CP43 protein (PsbC)471429.44
7Photosystem II protein D2 (PsbD)711224.89
8Replication protein Rep661213.18
9Photosystem II protein D1 (PsbA)83930.2
10Cytochrome b6-f complex subunit, cytochrome b651925.32
11Viral nonstructural protein39847.57
12ATP synthase alpha chain (EC 3.6.3.14)157804.47
13Ribonucleotide reductase of class Ia (aerobic), alpha subunit (EC 1.17.4.1)165776.57
14Thymidylate synthase thyX (EC 2.1.1.-)140771.16
15Single-stranded DNA-binding protein151769.41
16Major capsid protein100745.51
17ATP synthase beta chain (EC 3.6.3.14)156661.21
18UDP-glucose 4-epimerase (EC 5.1.3.2)169657.36
19Ribonucleotide reductase of class Ia (aerobic), beta subunit (EC 1.17.4.1)150652.32
20Integrase164633.18

nMG: number of metagenomes in which the functional role is present at least once; nCAI: normalized cumulative abundance index. For each metagenome, a normalized abundance index (nAI) was calculated as the relative, length-normalized number of functional roles per million EGTs, and the nAI values for each functional role were added up to generate the normalized cumulative abundance index (nCAI).

The 20 most abundant functional roles in metagenomes nMG: number of metagenomes in which the functional role is present at least once; nCAI: normalized cumulative abundance index. For each metagenome, a normalized abundance index (nAI) was calculated as the relative, length-normalized number of functional roles per million EGTs, and the nAI values for each functional role were added up to generate the normalized cumulative abundance index (nCAI). The metagenomic data sets, which have been sequenced by different research groups, have been analyzed, consistently annotated and made publicly available through the metagenomics RAST server [http://metagenomics.theseed.org (15)]. They include both free-living and metazoan-associated viral, bacterial and eukaryotic sequences from autotrophic and heterotrophic communities from a wide variety of environments. In the analyzed metagenomes, the two most abundant functional genes are related to transposable elements [transposase and the retrotransposon-related p150 protein (77)]. Next to these, a set of photosynthesis-related genes; genes encoding viral structural, nonstructural, capsid and integrase proteins; genes associated with DNA replication; and genes involved in DNA synthesis are all among the most abundant biological functions in environmental metagenomes (Table 2 and Supplementary Table S3). Since gene abundance in metagenomes is sensitive to sampling bias and sequencing depth, we also combined ubiquity with abundance data. The combined analysis confirmed the prevalence of transposases (abundant in 95% of the analyzed metagenomes) over the retrotransposon-related p150 genes (overly abundant in only 36% of these metagenomes) and other replication and DNA metabolism-related genes that are equally ubiquitous but less abundant than transposases (Figure 2). The abundance of all analyzed non-hypothetical functions does not necessarily correlate with their ubiquity (Pearson correlation index = 0.524, Figure 2), i.e. many EGTs were pervasive in some ecosystems but absent in others (e.g. photosystem II proteins, p150 and viral structural genes; Table 2). Ubiquitous EGTs, on the other hand, include those matching transposases, DNA polymerases and enzymes involved in nucleotide metabolism (e.g. dTDP-glucose 4,6-dehydratase, UDP-glucose 4-epimerase and RNR; see Table 3 and Supplementary Table S4). Most of the ubiquitous EGTs are likely to be ‘housekeeping’ and essential for life, rather than habitat-specific (Figure 2). Additionally, many of these EGTs (e.g. DNA polymerases and RNRs) are found in all cellular and non-cellular biological entities, including viruses. As with genome sequence data, transposases are unequally distributed in ecosystems. This unequal distribution is in accordance with studies of ocean community genomics that showed a depth-dependent abundance of transposase genes (30) and a recent study that reported an unusually high abundance of transposase and retroviral integrase genes in a hydrothermal chimney biofilm (49).
Figure 2.

The normalized cumulative abundance indices (nCAI) of different functional roles in 187 metagenomes plotted against the ubiquity of these functional roles (defined as the number of metagenomes in which the functional role is represented at least once). r, Pearson’s product moment correlation between abundance and ubiquity; DNA Pol, DNA polymerase; dTDP-G 4,6 DH, dTDP-glucose 4,6 dehydratase; Rep, replication-associated protein; RNR, ribonuleotide reductase; SSB, single-stranded DNA-binding protein; ThyX, thymidylate synthase thyX (EC 2.1.1.-); UDP-G 4-epi, UDP-glucose 4-epimerase.

Table 3.

The 20 most ubiquitous functional roles in metagenomes

RankFunctional rolenMG%
1Transposase17895.19
2DNA polymerase I (EC 2.7.7.7)17191.44
3dTDP-glucose 4,6-dehydratase (EC 4.2.1.46)17090.91
4DNA polymerase III alpha subunit (EC 2.7.7.7)17090.91
5ABC transporter, ATP-binding protein17090.91
6UDP-glucose 4-epimerase (EC 5.1.3.2)16990.37
7Heat shock protein 60 family chaperone GroEL16789.30
8Chaperone protein DnaK16789.30
9Ribonucleotide reductase of class II (coenzyme B12-dependent) (EC 1.17.4.1)16688.77
10Ribonucleotide reductase of class Ia (aerobic), alpha subunit (EC 1.17.4.1)16588.24
11Replicative DNA helicase (EC 3.6.1.-)16588.24
12Integrase16487.70
13Long-chain-fatty-acid–CoA ligase (EC 6.2.1.3)16487.70
14Phosphate starvation-inducible protein PhoH, predicted ATPase16387.17
15Carbamoyl-phosphate synthase large chain (EC 6.3.5.5)16387.17
16DNA primase (EC 2.7.7.-)16387.17
17Glycosyltransferase16387.17
18Valyl-tRNA synthetase (EC 6.1.1.9)16387.17
19Thymidylate synthase (EC 2.1.1.45)16387.17
20ATP-dependent Clp protease ATP-binding subunit clpX16286.63

nMG: number of metagenomes in which the functional role is present at least once; %: percentage of nMG to the total number of metagenomes analyzed (187).

The normalized cumulative abundance indices (nCAI) of different functional roles in 187 metagenomes plotted against the ubiquity of these functional roles (defined as the number of metagenomes in which the functional role is represented at least once). r, Pearson’s product moment correlation between abundance and ubiquity; DNA Pol, DNA polymerase; dTDP-G 4,6 DH, dTDP-glucose 4,6 dehydratase; Rep, replication-associated protein; RNR, ribonuleotide reductase; SSB, single-stranded DNA-binding protein; ThyX, thymidylate synthase thyX (EC 2.1.1.-); UDP-G 4-epi, UDP-glucose 4-epimerase. The 20 most ubiquitous functional roles in metagenomes nMG: number of metagenomes in which the functional role is present at least once; %: percentage of nMG to the total number of metagenomes analyzed (187). Other than the predominance of transposases, ABC transporter ATP-binding proteins and phage integrases (Table 2), there is little agreement in the gene abundance data between genomes and metagenomes (Tables 1 and 2). In genomic data, the most abundant functional roles reflect the over-representation of bacterial proteins in currently available fully sequenced genomes (2.5 million bacterial proteins versus 560 000 eukaryotic, 100 000 archaeal and 40 000 viral proteins). This bias may decrease when more viral genomes are sequenced and better annotated to reflect their actual distribution in nature. In metagenomic data, abundance indices reflect an overrepresentation of bacterial, archaeal and viral over eukaryotic sequences in currently available data sets; however, this overrepresentation is in agreement with reports that bacteria and archaea dominate the cellular world (78) while viruses are the most abundant biological entities (22,23).

DISCUSSION

The main assumption of this study is that the most successful genes are likely to be prevalent in genomes and ecosystems. We defined the most prevalent gene as the one ‘spreading its DNA around’ and not the one expressing the most protein molecules. Thus, while RuBisCO, for example, is claimed as the most abundant enzyme on Earth (11) based on the estimated number of its protein molecules, its genes are neither the most abundant nor most widely distributed (Supplementary Table S5). In addition, we focused on PEGs and did not include genes encoding ribosomal RNA in the analysis; those are absent in viruses and usually present in multiple copies in cellular genomes [1–15, mean = 4, (79)], which would place them at the 12th rank in gene abundance in all sequenced genomes (compare with Table 1). This study demonstrates that transposases are the most abundant genes in both completely sequenced genomes and environmental metagenomes, and are also the most ubiquitous in metagenomes. Transposase genes encode DNA-binding enzymes, members of the polynucleotidyl transferase superfamily, that catalyze ‘cut-and-paste’ or ‘copy-and-paste’ reactions promoting the movement of DNA segments to new sites (80). The term transposase is often used to describe what are classically known as DNA transposases or type II transposases. These move double-stranded DNA directly by excision and insertion, and are sometimes associated with insertion sequences, but often just catalyze their own mobilization (81,82). The major group of dsDNA transposases is known as DDE transposases due to their possession of a non-contiguous, highly conserved catalytic triad of two aspartate (D) and one glutamate (E) residues (83). Other protein families that essentially use transposition but lack the DDE motif include tyrosine and serine recombinases, and rolling-circle transposases (82). In addition, within these transposase subclasses, several protein family domains [PFam domains (84)] have been described (49,83), yet a large fraction of transposases identified in genomes and metagenomes remain unclassified. There are two other classes of transposable elements (Types I and III) that are distinguished as separate categories and were not as abundant or ubiquitous as Type II transposases in our analyzed data sets. Type I includes retrotransposons, which use the enzyme retrotransposase to move DNA by reverse transcription of an RNA intermediate (85). Retrotransposases (Type I transposases) are suggested to be responsible for the majority of ‘junk’ repeats, which make up >40% of the human genome and seem to code for no other genes (86–88). Type III transposable elements are associated with miniature inverted-repeat transposable elements (MITEs) (89,90). Transposases, in general, and Type II transposases, in particular, constitute a highly diverse group of enzymes. It is difficult to provide a robust, consistent scheme for classifying transposase sequences in ecosystems; however, structure-based classification schemes are being developed (83). The prevalence of transposons (Type II) and retrotransposons (Type I) in eukaryotic genomes has been well documented, but in these genomes they are mostly associated with non-coding, repetitive DNA (91–93). Moreover, Type II transposases are continuously being detected in bacterial, archaeal and, to a lesser extent, bacteriophage genomes. In this work, we demonstrate that these jumping genes are also almost omnipresent in every ecosystem that contains nucleic acid-based life forms.

OUTLOOK

Transposase genes have been classically considered as ‘selfish genes’ with no other purpose than spreading themselves and are thus expected to be universal DNA parasites (6,85). If this were their only raison-d’être, they have certainly fulfilled it by surviving, persisting and prevailing in all ecosystems. An open question is whether their ubiquity is also an indication of eco-essentiality. The finding that transposases are as ubiquitous as housekeeping DNA-processing enzymes but that they outnumber all essential genes (Figure 3) supports the idea that these mobile, self-replicating genes strive to inhabit and multiply in as many genomes as possible.
Figure 3.

Word clouds (created on http://www.wordle.net) representing (A) the 100 most abundant functional roles (Supplementary Table S3) and (B) the 100 most ubiquitous functional roles (Supplementary Table S4) in metagenomes. The font size of each functional role is proportional to its (A) abundance index or (B) number of metagenomes in which it is present.

Word clouds (created on http://www.wordle.net) representing (A) the 100 most abundant functional roles (Supplementary Table S3) and (B) the 100 most ubiquitous functional roles (Supplementary Table S4) in metagenomes. The font size of each functional role is proportional to its (A) abundance index or (B) number of metagenomes in which it is present. Besides the obvious detrimental effect that transposition can cause to host genomes—by inactivating housekeeping genes or impairing the chromosome’s structure—transposases also play beneficial roles (92). For example, transposases may mobilize or activate genes that enhance their hosts’ fitness (94,95), induce advantageous rearrangements (96) or enrich the host’s gene pool (97–100). There are accruing documented examples of transposase genes co-opted by the host to encode transcription factors (99), centromere-binding proteins (100) or generators of diversity in the immune system (97,98), a process described as exaptation [or domestication, from a host-centric view (94)]. Such cases can involve one or a few transposases per genome or, as more recently shown, thousands of transposases (95). Despite their ubiquity and abundance, there is neither evidence nor reason to believe that transposases encode conserved essential cellular functions. In our opinion, the role of transposases as diversifying agents (94,101) is beneficial enough to be selected for; however, the cost of transposon-induced mutations also puts pressure on the cells to inactivate or delete their transposases (16,91,93,101). In conclusion, the prevalence of transposases in metagenomes and completely sequenced genomes from bacteria, archaea, eukaryotes and viruses is in accordance with suggestions that they may offer a selective advantage to the genomes and ecosystems that they ‘parasitize’ (17,94,101). The diversification they induce in these genomes and ecosystems is arguably an essential way of maintaining, diversifying and evolving life on our planet.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

National Science Foundation, Division of Biological Infrastructure (DBI-0850356 to R.A.E. and DBI-0850206 to M.B.); the NMPDR project was supported by National Institutes of Health (HHSN266200400042C). Funding for open access charge: National Science Foundation, Division of Biological Infrastructure (DBI-0850356 to R.A.E.). Conflict of interest statement. None declared.
  96 in total

Review 1.  Microbial ecology in the age of genomics and metagenomics: concepts, tools, and recent advances.

Authors:  Jianping Xu
Journal:  Mol Ecol       Date:  2006-06       Impact factor: 6.185

2.  Positive selection on transposase genes of insertion sequences in the Crocosphaera watsonii genome.

Authors:  Ted H M Mes; Marije Doeleman
Journal:  J Bacteriol       Date:  2006-10       Impact factor: 3.490

3.  Functional metagenomic profiling of nine biomes.

Authors:  Elizabeth A Dinsdale; Robert A Edwards; Dana Hall; Florent Angly; Mya Breitbart; Jennifer M Brulc; Mike Furlan; Christelle Desnues; Matthew Haynes; Linlin Li; Lauren McDaniel; Mary Ann Moran; Karen E Nelson; Christina Nilsson; Robert Olson; John Paul; Beltran Rodriguez Brito; Yijun Ruan; Brandon K Swan; Rick Stevens; David L Valentine; Rebecca Vega Thurber; Linda Wegley; Bryan A White; Forest Rohwer
Journal:  Nature       Date:  2008-03-12       Impact factor: 49.962

4.  Metagenomic and stable isotopic analyses of modern freshwater microbialites in Cuatro Ciénegas, Mexico.

Authors:  Mya Breitbart; Ana Hoare; Anthony Nitti; Janet Siefert; Matthew Haynes; Elizabeth Dinsdale; Robert Edwards; Valeria Souza; Forest Rohwer; David Hollander
Journal:  Environ Microbiol       Date:  2008-09-01       Impact factor: 5.491

5.  Convergent domestication of pogo-like transposases into centromere-binding proteins in fission yeast and mammals.

Authors:  Claudio Casola; Donald Hucks; Cédric Feschotte
Journal:  Mol Biol Evol       Date:  2007-10-16       Impact factor: 16.240

Review 6.  Integrating prokaryotes and eukaryotes: DNA transposases in light of structure.

Authors:  Alison Burgess Hickman; Michael Chandler; Fred Dyda
Journal:  Crit Rev Biochem Mol Biol       Date:  2010-02       Impact factor: 8.250

7.  Low genomic diversity in tropical oceanic N2-fixing cyanobacteria.

Authors:  Jonathan P Zehr; Shellie R Bench; Elizabeth A Mondragon; Jay McCarren; Edward F DeLong
Journal:  Proc Natl Acad Sci U S A       Date:  2007-10-30       Impact factor: 11.205

Review 8.  Exploring prokaryotic diversity in the genomic era.

Authors:  Philip Hugenholtz
Journal:  Genome Biol       Date:  2002-01-29       Impact factor: 13.583

9.  High abundance of virulence gene homologues in marine bacteria.

Authors:  Olof P Persson; Jarone Pinhassi; Lasse Riemann; Britt-Inger Marklund; Mikael Rhen; Staffan Normark; José M González; Ake Hagström
Journal:  Environ Microbiol       Date:  2009-02-04       Impact factor: 5.491

10.  Methods for comparative metagenomics.

Authors:  Daniel H Huson; Daniel C Richter; Suparna Mitra; Alexander F Auch; Stephan C Schuster
Journal:  BMC Bioinformatics       Date:  2009-01-30       Impact factor: 3.169

View more
  113 in total

1.  A metagenome of a full-scale microbial community carrying out enhanced biological phosphorus removal.

Authors:  Mads Albertsen; Lea Benedicte Skov Hansen; Aaron Marc Saunders; Per Halkjær Nielsen; Kåre Lehmann Nielsen
Journal:  ISME J       Date:  2011-12-15       Impact factor: 10.302

2.  The role of vertical and horizontal transfer in the evolution of Paris-like elements in drosophilid species.

Authors:  Gabriel Luz Wallau; Valéria Lima Kaminski; Elgion L S Loreto
Journal:  Genetica       Date:  2012-04-24       Impact factor: 1.082

3.  An Atypical AAA+ ATPase Assembly Controls Efficient Transposition through DNA Remodeling and Transposase Recruitment.

Authors:  Ernesto Arias-Palomo; James M Berger
Journal:  Cell       Date:  2015-08-13       Impact factor: 41.582

Review 4.  Selfish genetic elements, genetic conflict, and evolutionary innovation.

Authors:  John H Werren
Journal:  Proc Natl Acad Sci U S A       Date:  2011-06-20       Impact factor: 11.205

5.  Unlocking Tn3-family transposase activity in vitro unveils an asymetric pathway for transposome assembly.

Authors:  Emilien Nicolas; Cédric A Oger; Nathan Nguyen; Michaël Lambin; Amandine Draime; Sébastien C Leterme; Michael Chandler; Bernard F J Hallet
Journal:  Proc Natl Acad Sci U S A       Date:  2017-01-17       Impact factor: 11.205

6.  Insertion sequences enrichment in extreme Red sea brine pool vent.

Authors:  Ali H A Elbehery; Ramy K Aziz; Rania Siam
Journal:  Extremophiles       Date:  2016-12-03       Impact factor: 2.395

7.  Bacterial growth at -15 °C; molecular insights from the permafrost bacterium Planococcus halocryophilus Or1.

Authors:  Nadia C S Mykytczuk; Simon J Foote; Chris R Omelon; Gordon Southam; Charles W Greer; Lyle G Whyte
Journal:  ISME J       Date:  2013-02-07       Impact factor: 10.302

8.  FXR-Dependent Modulation of the Human Small Intestinal Microbiome by the Bile Acid Derivative Obeticholic Acid.

Authors:  Elliot S Friedman; Yun Li; Ting-Chin David Shen; Jack Jiang; Lillian Chau; Luciano Adorini; Farah Babakhani; Jeffrey Edwards; David Shapiro; Chunyu Zhao; Rotonya M Carr; Kyle Bittinger; Hongzhe Li; Gary D Wu
Journal:  Gastroenterology       Date:  2018-08-23       Impact factor: 22.682

9.  CENP-B cooperates with Set1 in bidirectional transcriptional silencing and genome organization of retrotransposons.

Authors:  David R Lorenz; Irina V Mikheyeva; Peter Johansen; Lauren Meyer; Anastasia Berg; Shiv I S Grewal; Hugh P Cam
Journal:  Mol Cell Biol       Date:  2012-08-20       Impact factor: 4.272

10.  De Novo assembly of the complete genome of an enhanced electricity-producing variant of Geobacter sulfurreducens using only short reads.

Authors:  Harish Nagarajan; Jessica E Butler; Anna Klimes; Yu Qiu; Karsten Zengler; Joy Ward; Nelson D Young; Barbara A Methé; Bernhard Ø Palsson; Derek R Lovley; Christian L Barrett
Journal:  PLoS One       Date:  2010-06-08       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.