Literature DB >> 20215432

Transposases are the most abundant, most ubiquitous genes in nature.

Ramy K Aziz¹, Mya Breitbart, Robert A Edwards.

Abstract

Genes, like organisms, struggle for existence, and the most successful genes persist and widely disseminate in nature. The unbiased determination of the most successful genes requires access to sequence data from a wide range of phylogenetic taxa and ecosystems, which has finally become achievable thanks to the deluge of genomic and metagenomic sequences. Here, we analyzed 10 million protein-encoding genes and gene tags in sequenced bacterial, archaeal, eukaryotic and viral genomes and metagenomes, and our analysis demonstrates that genes encoding transposases are the most prevalent genes in nature. The finding that these genes, classically considered as selfish genes, outnumber essential or housekeeping genes suggests that they offer selective advantage to the genomes and ecosystems they inhabit, a hypothesis in agreement with an emerging body of literature. Their mobile nature not only promotes dissemination of transposable elements within and between genomes but also leads to mutations and rearrangements that can accelerate biological diversification and--consequently--evolution. By securing their own replication and dissemination, transposases guarantee to thrive so long as nucleic acid-based life forms exist.

Entities: CellLine Chemical Disease Species

Mesh：

Substances：
Transposases

Year: 2010 PMID： 20215432 PMCID： PMC2910039 DOI： 10.1093/nar/gkq140

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Since life first emerged, organisms have been struggling for survival and competing over the finite resources within their ecosystems (1,2). This struggle for survival not only is confined to the organism level but it also applies to individual genes (3) and even non-coding DNA segments (4,5). As a corollary, a gene’s success can be determined by its ability to persist in nature and to be spread throughout genomes and biomes (6). For this to take place, genes need some sequence plasticity to adapt to different environments while retaining enough sequence conservation to preserve the structure of their encoded proteins and the identity of their encoded biological functions (7,8). Every time a new genome is sequenced, many genes are identified and annotated based on their homology to sequences available in databases, but new genes with novel functions are also identified, adding to the universal gene pool. To date, no study has systematically and directly surveyed the millions of protein-encoding genes (PEGs) deposited in sequence databases to identify their relative prevalence. There have been several challenges to such an endeavor: (i) the absence of numerical parameters to assess a gene’s prevalence; (ii) the lack of fair representation of the tree of life within available sequence data (9,10) and (iii) the difficulty of defining what is meant by ‘same gene’ in different organisms and ecosystems. To overcome these difficulties, (i) we calculated both the abundance and ubiquity of all known biological functions encoded in genomes and ecosystems to estimate their prevalence, with the assumption that these values will be correlated with gene fitness; (ii) we surveyed both genomic and metagenomic data sets to reduce bias caused by the uneven sampling of the tree of life in genomic data sets; and (iii) we defined similar genes as those encoding proteins with similar specific biological functions. In some instances, this definition could be regarded as an oversimplification, notably in cases of convergent evolution or homoplasy, where multiple genes of different origin evolve to perform similar biological functions. However, the majority of current gene annotations are specific enough to distinguish many instances of paralogous genes or different classes within gene/protein families. It is also understandable that different genes are under different selection pressures, as some are forced to endure mutations and tolerate sequence variability to escape pressure (e.g. bacterial genes encoding immunogenic proteins that are under pressure of the host immune system and genes encoding surface proteins that are easily recognized by predators) while others are under strict sequence conservation pressure (e.g. genes encoding housekeeping enzymes and essential biological functions). Importantly, in determining gene prevalence we distinguished between ubiquity and abundance. Ubiquity is one of the indicators of essentiality, while abundance without ubiquity is an indicator of adaptive, organism-specific or habitat-specific functionality. In other words, ubiquitous genes are assumed to be those that carry essential functions and are thus indispensable in every genome (elements of core genomes) or every ecosystem (eco-essential genes). On the other hand, genes that are overly abundant in few ecosystems and absent in others are likely to play essential habitat-specific roles (e.g. photosynthesis, anaerobic metabolism, detoxification, etc.). Contenders for the ‘fittest gene’ title include the gene encoding ribulose-1,5-bisphosphate carboxylase/oxygenase (RuBisCO), an enzyme that plays a critical role in the fixation of carbon dioxide via the Calvin cycle and that has been touted as the single most successful, most abundant enzyme on the planet (11). Genes encoding ribosomal proteins are also plausible candidates. However, those are largely limited to cellular life forms, are not essential and almost absent in viruses (12), and are divergent between eukaryotes and prokaryotes. Additionally, DNA polymerase genes and other genes involved in DNA synthesis and nucleotide/nucleoside metabolism (e.g. ribonucleotide reductases, RNR) are essential for DNA-based life and are not restricted to cellular organisms, being found in viral genomes as well. Their essentiality favors them as strong competitors; yet, they are often present at one or few copies per genome. To our surprise, none of the previous candidate genes topped the list of the most abundant, most ubiquitous genes. Instead, our analysis singled out genes encoding transposases as the most abundant genes in genomes and metagenomes, and the most ubiquitous in metagenomes.

ANALYSIS OF GENOMES AND METAGENOMES

To determine the most abundant non-hypothetical PEG, we examined almost 10 million annotated genes or gene tags: over 3.2 million PEGs in fully sequenced viral, bacterial, archaeal and eukaryotic genomes (2137 genomes on 1 May 2009) and over 6.7 million environmental gene tags (EGT)—with significant matches to known proteins—in 187 random community genomes (metagenomes). For functional assignments, we mostly relied on the annotations available in the SEED database (13) because it uses subsystems-based controlled vocabulary curated by human experts and automatically propagated among genomes (14). For consistency, the same SEED subsystems were used for the annotation of all metagenomic data sets described in this study (15).

Analysis of complete genome sequences

We screened 2137 complete genomes (47 archaeal, 725 bacterial, 29 eukaryotic and 1336 viral genomes at the time this study was performed) available in the SEED database (URL: http://seed-viewer.theseed.org) and identified 37 258 PEGs (1.163% of all PEGs) annotated as transposase-related. Out of these, 26 625 (0.825% of all PEGs) were explicitly annotated as ‘transposases’, 360 were annotated as ‘degenerate transposases’, and then there were a variety of insertion sequence-related transposases, which may or may not be functional. Even when these ambiguous annotations were excluded from the final counts, transposases remained the most abundant PEGs in the completely sequenced genomes (Figure 1).

Figure 1.

Abundance of different functional roles in 2137 genomes plotted against the ubiquity of these functional roles (defined as the number of genomes in which the functional role is represented at least once). r, Pearson’s product moment correlation between abundance and ubiquity; Cys, cysteine; Thio, thioredoxin; ThioR, thioredoxin reductase. Proteins annotated solely based on their location or posttranslational modification but not their biological functions (e.g. membrane proteins, cytoplasmic proteins, secreted proteins, transmembrane proteins and generic lipoproteins) were excluded; an exception was the ‘outer membrane protein’ annotation as it describes specific bacterial proteins rather than protein localization. These data imply that out of a set of 2000 randomly sampled genes (the average number of genes in a typical bacterial genome), 22 genes are expected to encode transposases, at least 16 of which are likely functional. Obviously, genomes that have transposase genes tend to have them in multiple copies; this explains why although two-thirds of sequenced genomes (mostly viral) lack known functional transposases, the average number of transposases—when present—is 38.42 per genome (Table 1 and Supplementary Table S1). This observation is also in agreement with reports that transposases are unequally distributed among bacterial genomes, with higher abundance in facultative pathogens and free-living bacteria than in obligate pathogens and endosymbionts (16), and with extraordinarily high numbers in some species, e.g. Crocosphaera watsonii (17,18).

Table 1.

The 20 most abundant non-hypothetical protein-encoding genes in all sequenced genomes

Rank	Functional role	nG Count	V	A	B	E	C/n	%
1	Transposase	693	15 (1.1%)	31 (66%)	630 (86.9%)	17 (58.6%)	38.42	0.83
		26 625	21	736	25 226	642
2	ABC transporter, ATP-binding protein	738	1 (<1%)	39 (83%)	682 (94.1%)	16 (55.2%)	12.71	0.29
		9382	1	264	8998	119
3	Sensor histidine kinase	574	–	22 (46.8%)	550 (75.9%)	2 (6.9%)	9.71	0.17
		5575		294	5276	5
4	DNA-binding response regulator	578	–	13 (27.7%)	562 (77.5%)	3 (10.3%)	8.20	0.15
		4708		33	4669	6
5	Methyl-accepting chemotaxis protein	408	1 (<1%)	15 (31.9%)	391 (53.9%)	1 (3.4%)	10.76	0.14
		4389	1	64	4318	6
6	ABC transporter, permease protein	580	–	33 (70.2%)	545 (75.2%)	2 (6.9%)	7.55	0.14
		4,377		137	4238	2
7	Glycosyltransferase (EC 2.4.1.-)	649	–	41 (87.2%)	598 (82.5%)	10 (34.5%)	6.43	0.13
		4172		287	3863	22
8	Transcriptional regulator, LysR family	441	–	8 (17%)	430 (59.3%)	3 (10.3%)	9.15	0.13
		4037		10	4017	10
9	Transcriptional regulator, TetR family	535	–	14 (29.8%)	521 (71.9%)	–	6.93	0.12
		3709		45	3664
10	Acetyltransferase, GNAT family	480	–	19 (40.4%)	458 (63.2%)	3 (10.3%)	7.33	0.11
		3516		53	3453	10
11	Transcriptional regulator, AraC family	459	1 (<1%)	7 (14.9%)	450 (62.1%)	1 (3.4%)	7.37	0.11
		3382	1	7	3373	1
12	Long-chain-fatty-acid–CoA ligase (EC 6.2.1.3)	589	–	31 (66%)	533 (73.5%)	25 (86.2%)	5.08	0.09
		2995		68	2728	199
13	Transcriptional regulator, MarR family	546	–	23 (48.9%)	522 (72%)	1 (3.4%)	5.32	0.09
		2905		71	2831	3
14	Permeases of the major facilitator superfamily	393	1 (<1%)	12 (25.5%)	375 (51.7%)	5 (17.2%)	6.95	0.09
		2733	1	22	2701	9
15	Acetyltransferase (EC 2.3.1.-)	559	4^a (<1%)	22 (46.8%)	532 (73.4%)	5 (17.2%)	4.36	0.08
		2436	4	57	2374	5
16	Cysteine desulfurase (EC 2.8.1.7)	783	–	36 (76.6%)	722 (99.6%)	25 (86.2%)	3.02	0.07
		2362		66	2239	57
17	3-oxoacyl-[acyl-carrier protein] reductase (EC 1.1.1.100)	706	–	27 (57.4%)	665 (91.7%)	14 (48.3%)	2.80	0.06
		1975		68	1863	44
18	Integrase	534	70 (5.2%)	11 (23.4%)	448 (61.8%)	5 (17.2%)	3.43	0.06
		1829	70	19	1729	11
19	Outer membrane protein	415	1 (<1%)	10 (21.3%)	402 (55.4%)	2 (6.9%)	4.34	0.06
		1803	1	12	1786	4
20	Permease of the drug/metabolite transporter (DMT) superfamily	518	–	28 (59.6%)	486 (67%)	4 (13.8%)	3.37	0.05
		1746		53	1688	5

nG: number of genomes in which the functional role is present at least once; Count: number of genes in all sequenced genomes; V, A, B, E: viruses, archaea, bacteria, eukarya, respectively; C/n: average number of genes per positive genome; %: percentage of genes to the total number of genes in all genomes (n = 3 204 918).

aAcetyltransferase-like proteins that were missed in the automated analysis.

The 20 most abundant non-hypothetical protein-encoding genes in all sequenced genomes nG: number of genomes in which the functional role is present at least once; Count: number of genes in all sequenced genomes; V, A, B, E: viruses, archaea, bacteria, eukarya, respectively; C/n: average number of genes per positive genome; %: percentage of genes to the total number of genes in all genomes (n = 3 204 918). aAcetyltransferase-like proteins that were missed in the automated analysis. While the abundance of transposase genes in microbial genomes has been recognized for long time, only recently has it been exploited for inferring microbial cohabitation patterns and lateral gene transfer (19). Next to transposases, the most abundant functional roles in all sequenced genomes include ABC transporters, transcriptional regulators of different families, signal transduction kinases, chemotaxis proteins, acetyl- and glycosyl- transferases and cysteine desulfurase (Table 1 and Supplementary Table S1). On the other hand, the most ubiquitous functional roles in sequenced genomes are encoded by low-copy-number genes that consequently have a low overall abundance. Only four out of the 100 most ubiquitous functional genes have a mean copy number >2 per genome. These are genes encoding thioredoxin reductase; thioredoxin; cysteine desulfurase and the ABC transporter, ATP-binding protein (Supplementary Table S2). The list of most ubiquitous functional roles in genomes was topped by tRNA synthetases (Figure 1), and other genes associated with protein synthesis and post-translational protein sorting (e.g. translation elongation factor and preprotein translocase, Supplementary Table S2).

Analysis of metagenomic sequences

In spite of the striking prevalence and high copy numbers of transposase genes in fully sequenced genomes, the use of those data sets is prone to biases. The available genomes unevenly represent the tree of life as they mostly correspond to cultured organisms from just four bacterial phyla (9). Moreover, there is an over-representation of microbes of interest to humans (20), such as bacterial pathogens and microbes used in agriculture or industry (21). Finally, while viruses are at least 10 times as abundant as bacteria in nature (22,23), sequenced viral genomes are lagging behind both in terms of numbers (∼2:1 viral to bacterial genome ratio) and annotation quality (most encoded proteins are of unknown functions). In contrast, analysis of community genomes (metagenomes) offers a less-biased representation of life forms and biological functions in various habitats. The term ‘metagenome’ describes the collective genomes found in a particular ecosystem (24,25). Since the first uncultured viral community genomic sequences were published in 2002 (26), metagenomics has emerged as a rapid and efficient method of identifying not only the species present in a given ecosystem but also the ecosystem-associated metabolic signatures or patterns (27–31). The emergence of low-cost, high-throughput next-generation sequencing technologies (32–37) has enabled the quick implementation of metagenomics in the analysis of different environments, allowing an unprecedented view of biodiversity (25,38–42). Over the past few years, metagenomic sequencing has been used to explore a wide range of environments, encompassing various marine ecosystems (28,43–47), hydrothermal vents (48,49), corals (50–52), salterns (53,54), soil (55–57), sludge (58), mines (59), human and animal guts (60–64) and lungs (65), microbialites (66,67) and even mosquitoes (31). Metagenomic analysis is shifting the paradigm from organism/genome-centric to gene-centric and pathway-centric approaches to understanding biodiversity (68,69). Several bioinformatics and statistical tools allow the metabolic reconstruction of a particular ecosystem by enumerating EGTs in metagenomes and binning them either phylogenetically or biochemically (15,68,70–73), as well as the comparison of multiple metagenomes (59,74,75). In this study, we followed a gene-centric approach by enumerating EGTs, and estimating the abundance and ubiquity of their different functional roles in 187 different metagenomic samples representing a broad range of environments. Assessing EGT abundance in metagenomic data is slightly different from determining PEG frequency in fully sequenced genomes. In genomes, a single, full-length copy of a gene reflects a single occurrence of that gene in one cell of an organism. In metagenomic data, multiple occurrence of an EGT can be attributed to either multiple copies of the same gene, multiple orthologs (from different genomes), multiple paralogs or just multiple sequences covering different parts of the exact same DNA segment. Moreover, the coding sequence length is a potential confounding factor: longer genes are more likely to be sampled by random sequencing (unless the sample is large enough to provide 100% coverage). For those reasons, the frequency of each EGT was normalized to the mean length of the most similar proteins [from BLASTX (76) results] to generate an abundance index, which was further divided by the number of informative sequence reads (those sequence tags matching annotated proteins in known databases) to generate a normalized abundance index (see the legend of Table 2 for more details).

Table 2.

The 20 most abundant functional roles in metagenomes

Rank	Functional role	nMG	nCAI
1	Transposase	178	4026.17
2	Retrotransposon-related p150 protein	69	3412.12
3	Viral structural protein	126	1909.75
4	ABC transporter, ATP-binding protein	170	1528.03
5	Replication-associated protein	32	1481.67
6	Photosystem II CP43 protein (PsbC)	47	1429.44
7	Photosystem II protein D2 (PsbD)	71	1224.89
8	Replication protein Rep	66	1213.18
9	Photosystem II protein D1 (PsbA)	83	930.2
10	Cytochrome b6-f complex subunit, cytochrome b6	51	925.32
11	Viral nonstructural protein	39	847.57
12	ATP synthase alpha chain (EC 3.6.3.14)	157	804.47
13	Ribonucleotide reductase of class Ia (aerobic), alpha subunit (EC 1.17.4.1)	165	776.57
14	Thymidylate synthase thyX (EC 2.1.1.-)	140	771.16
15	Single-stranded DNA-binding protein	151	769.41
16	Major capsid protein	100	745.51
17	ATP synthase beta chain (EC 3.6.3.14)	156	661.21
18	UDP-glucose 4-epimerase (EC 5.1.3.2)	169	657.36
19	Ribonucleotide reductase of class Ia (aerobic), beta subunit (EC 1.17.4.1)	150	652.32
20	Integrase	164	633.18

nMG: number of metagenomes in which the functional role is present at least once; nCAI: normalized cumulative abundance index. For each metagenome, a normalized abundance index (nAI) was calculated as the relative, length-normalized number of functional roles per million EGTs, and the nAI values for each functional role were added up to generate the normalized cumulative abundance index (nCAI).

The 20 most abundant functional roles in metagenomes nMG: number of metagenomes in which the functional role is present at least once; nCAI: normalized cumulative abundance index. For each metagenome, a normalized abundance index (nAI) was calculated as the relative, length-normalized number of functional roles per million EGTs, and the nAI values for each functional role were added up to generate the normalized cumulative abundance index (nCAI). The metagenomic data sets, which have been sequenced by different research groups, have been analyzed, consistently annotated and made publicly available through the metagenomics RAST server [http://metagenomics.theseed.org (15)]. They include both free-living and metazoan-associated viral, bacterial and eukaryotic sequences from autotrophic and heterotrophic communities from a wide variety of environments. In the analyzed metagenomes, the two most abundant functional genes are related to transposable elements [transposase and the retrotransposon-related p150 protein (77)]. Next to these, a set of photosynthesis-related genes; genes encoding viral structural, nonstructural, capsid and integrase proteins; genes associated with DNA replication; and genes involved in DNA synthesis are all among the most abundant biological functions in environmental metagenomes (Table 2 and Supplementary Table S3). Since gene abundance in metagenomes is sensitive to sampling bias and sequencing depth, we also combined ubiquity with abundance data. The combined analysis confirmed the prevalence of transposases (abundant in 95% of the analyzed metagenomes) over the retrotransposon-related p150 genes (overly abundant in only 36% of these metagenomes) and other replication and DNA metabolism-related genes that are equally ubiquitous but less abundant than transposases (Figure 2). The abundance of all analyzed non-hypothetical functions does not necessarily correlate with their ubiquity (Pearson correlation index = 0.524, Figure 2), i.e. many EGTs were pervasive in some ecosystems but absent in others (e.g. photosystem II proteins, p150 and viral structural genes; Table 2). Ubiquitous EGTs, on the other hand, include those matching transposases, DNA polymerases and enzymes involved in nucleotide metabolism (e.g. dTDP-glucose 4,6-dehydratase, UDP-glucose 4-epimerase and RNR; see Table 3 and Supplementary Table S4). Most of the ubiquitous EGTs are likely to be ‘housekeeping’ and essential for life, rather than habitat-specific (Figure 2). Additionally, many of these EGTs (e.g. DNA polymerases and RNRs) are found in all cellular and non-cellular biological entities, including viruses. As with genome sequence data, transposases are unequally distributed in ecosystems. This unequal distribution is in accordance with studies of ocean community genomics that showed a depth-dependent abundance of transposase genes (30) and a recent study that reported an unusually high abundance of transposase and retroviral integrase genes in a hydrothermal chimney biofilm (49).

Figure 2.

Table 3.

The 20 most ubiquitous functional roles in metagenomes

Rank	Functional role	nMG	%
1	Transposase	178	95.19
2	DNA polymerase I (EC 2.7.7.7)	171	91.44
3	dTDP-glucose 4,6-dehydratase (EC 4.2.1.46)	170	90.91
4	DNA polymerase III alpha subunit (EC 2.7.7.7)	170	90.91
5	ABC transporter, ATP-binding protein	170	90.91
6	UDP-glucose 4-epimerase (EC 5.1.3.2)	169	90.37
7	Heat shock protein 60 family chaperone GroEL	167	89.30
8	Chaperone protein DnaK	167	89.30
9	Ribonucleotide reductase of class II (coenzyme B12-dependent) (EC 1.17.4.1)	166	88.77
10	Ribonucleotide reductase of class Ia (aerobic), alpha subunit (EC 1.17.4.1)	165	88.24
11	Replicative DNA helicase (EC 3.6.1.-)	165	88.24
12	Integrase	164	87.70
13	Long-chain-fatty-acid–CoA ligase (EC 6.2.1.3)	164	87.70
14	Phosphate starvation-inducible protein PhoH, predicted ATPase	163	87.17
15	Carbamoyl-phosphate synthase large chain (EC 6.3.5.5)	163	87.17
16	DNA primase (EC 2.7.7.-)	163	87.17
17	Glycosyltransferase	163	87.17
18	Valyl-tRNA synthetase (EC 6.1.1.9)	163	87.17
19	Thymidylate synthase (EC 2.1.1.45)	163	87.17
20	ATP-dependent Clp protease ATP-binding subunit clpX	162	86.63

nMG: number of metagenomes in which the functional role is present at least once; %: percentage of nMG to the total number of metagenomes analyzed (187).

The normalized cumulative abundance indices (nCAI) of different functional roles in 187 metagenomes plotted against the ubiquity of these functional roles (defined as the number of metagenomes in which the functional role is represented at least once). r, Pearson’s product moment correlation between abundance and ubiquity; DNA Pol, DNA polymerase; dTDP-G 4,6 DH, dTDP-glucose 4,6 dehydratase; Rep, replication-associated protein; RNR, ribonuleotide reductase; SSB, single-stranded DNA-binding protein; ThyX, thymidylate synthase thyX (EC 2.1.1.-); UDP-G 4-epi, UDP-glucose 4-epimerase. The 20 most ubiquitous functional roles in metagenomes nMG: number of metagenomes in which the functional role is present at least once; %: percentage of nMG to the total number of metagenomes analyzed (187). Other than the predominance of transposases, ABC transporter ATP-binding proteins and phage integrases (Table 2), there is little agreement in the gene abundance data between genomes and metagenomes (Tables 1 and 2). In genomic data, the most abundant functional roles reflect the over-representation of bacterial proteins in currently available fully sequenced genomes (2.5 million bacterial proteins versus 560 000 eukaryotic, 100 000 archaeal and 40 000 viral proteins). This bias may decrease when more viral genomes are sequenced and better annotated to reflect their actual distribution in nature. In metagenomic data, abundance indices reflect an overrepresentation of bacterial, archaeal and viral over eukaryotic sequences in currently available data sets; however, this overrepresentation is in agreement with reports that bacteria and archaea dominate the cellular world (78) while viruses are the most abundant biological entities (22,23).

DISCUSSION

The main assumption of this study is that the most successful genes are likely to be prevalent in genomes and ecosystems. We defined the most prevalent gene as the one ‘spreading its DNA around’ and not the one expressing the most protein molecules. Thus, while RuBisCO, for example, is claimed as the most abundant enzyme on Earth (11) based on the estimated number of its protein molecules, its genes are neither the most abundant nor most widely distributed (Supplementary Table S5). In addition, we focused on PEGs and did not include genes encoding ribosomal RNA in the analysis; those are absent in viruses and usually present in multiple copies in cellular genomes [1–15, mean = 4, (79)], which would place them at the 12th rank in gene abundance in all sequenced genomes (compare with Table 1). This study demonstrates that transposases are the most abundant genes in both completely sequenced genomes and environmental metagenomes, and are also the most ubiquitous in metagenomes. Transposase genes encode DNA-binding enzymes, members of the polynucleotidyl transferase superfamily, that catalyze ‘cut-and-paste’ or ‘copy-and-paste’ reactions promoting the movement of DNA segments to new sites (80). The term transposase is often used to describe what are classically known as DNA transposases or type II transposases. These move double-stranded DNA directly by excision and insertion, and are sometimes associated with insertion sequences, but often just catalyze their own mobilization (81,82). The major group of dsDNA transposases is known as DDE transposases due to their possession of a non-contiguous, highly conserved catalytic triad of two aspartate (D) and one glutamate (E) residues (83). Other protein families that essentially use transposition but lack the DDE motif include tyrosine and serine recombinases, and rolling-circle transposases (82). In addition, within these transposase subclasses, several protein family domains [PFam domains (84)] have been described (49,83), yet a large fraction of transposases identified in genomes and metagenomes remain unclassified. There are two other classes of transposable elements (Types I and III) that are distinguished as separate categories and were not as abundant or ubiquitous as Type II transposases in our analyzed data sets. Type I includes retrotransposons, which use the enzyme retrotransposase to move DNA by reverse transcription of an RNA intermediate (85). Retrotransposases (Type I transposases) are suggested to be responsible for the majority of ‘junk’ repeats, which make up >40% of the human genome and seem to code for no other genes (86–88). Type III transposable elements are associated with miniature inverted-repeat transposable elements (MITEs) (89,90). Transposases, in general, and Type II transposases, in particular, constitute a highly diverse group of enzymes. It is difficult to provide a robust, consistent scheme for classifying transposase sequences in ecosystems; however, structure-based classification schemes are being developed (83). The prevalence of transposons (Type II) and retrotransposons (Type I) in eukaryotic genomes has been well documented, but in these genomes they are mostly associated with non-coding, repetitive DNA (91–93). Moreover, Type II transposases are continuously being detected in bacterial, archaeal and, to a lesser extent, bacteriophage genomes. In this work, we demonstrate that these jumping genes are also almost omnipresent in every ecosystem that contains nucleic acid-based life forms.

OUTLOOK

Transposase genes have been classically considered as ‘selfish genes’ with no other purpose than spreading themselves and are thus expected to be universal DNA parasites (6,85). If this were their only raison-d’être, they have certainly fulfilled it by surviving, persisting and prevailing in all ecosystems. An open question is whether their ubiquity is also an indication of eco-essentiality. The finding that transposases are as ubiquitous as housekeeping DNA-processing enzymes but that they outnumber all essential genes (Figure 3) supports the idea that these mobile, self-replicating genes strive to inhabit and multiply in as many genomes as possible.

Figure 3.

Word clouds (created on http://www.wordle.net) representing (A) the 100 most abundant functional roles (Supplementary Table S3) and (B) the 100 most ubiquitous functional roles (Supplementary Table S4) in metagenomes. The font size of each functional role is proportional to its (A) abundance index or (B) number of metagenomes in which it is present. Besides the obvious detrimental effect that transposition can cause to host genomes—by inactivating housekeeping genes or impairing the chromosome’s structure—transposases also play beneficial roles (92). For example, transposases may mobilize or activate genes that enhance their hosts’ fitness (94,95), induce advantageous rearrangements (96) or enrich the host’s gene pool (97–100). There are accruing documented examples of transposase genes co-opted by the host to encode transcription factors (99), centromere-binding proteins (100) or generators of diversity in the immune system (97,98), a process described as exaptation [or domestication, from a host-centric view (94)]. Such cases can involve one or a few transposases per genome or, as more recently shown, thousands of transposases (95). Despite their ubiquity and abundance, there is neither evidence nor reason to believe that transposases encode conserved essential cellular functions. In our opinion, the role of transposases as diversifying agents (94,101) is beneficial enough to be selected for; however, the cost of transposon-induced mutations also puts pressure on the cells to inactivate or delete their transposases (16,91,93,101). In conclusion, the prevalence of transposases in metagenomes and completely sequenced genomes from bacteria, archaea, eukaryotes and viruses is in accordance with suggestions that they may offer a selective advantage to the genomes and ecosystems that they ‘parasitize’ (17,94,101). The diversification they induce in these genomes and ecosystems is arguably an essential way of maintaining, diversifying and evolving life on our planet.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

National Science Foundation, Division of Biological Infrastructure (DBI-0850356 to R.A.E. and DBI-0850206 to M.B.); the NMPDR project was supported by National Institutes of Health (HHSN266200400042C). Funding for open access charge: National Science Foundation, Division of Biological Infrastructure (DBI-0850356 to R.A.E.). Conflict of interest statement. None declared.

96 in total

Review 1. Microbial ecology in the age of genomics and metagenomics: concepts, tools, and recent advances.

Authors: Jianping Xu
Journal: Mol Ecol Date: 2006-06 Impact factor: 6.185

2. Positive selection on transposase genes of insertion sequences in the Crocosphaera watsonii genome.

Authors: Ted H M Mes; Marije Doeleman
Journal: J Bacteriol Date: 2006-10 Impact factor: 3.490

3. Functional metagenomic profiling of nine biomes.

Authors: Elizabeth A Dinsdale; Robert A Edwards; Dana Hall; Florent Angly; Mya Breitbart; Jennifer M Brulc; Mike Furlan; Christelle Desnues; Matthew Haynes; Linlin Li; Lauren McDaniel; Mary Ann Moran; Karen E Nelson; Christina Nilsson; Robert Olson; John Paul; Beltran Rodriguez Brito; Yijun Ruan; Brandon K Swan; Rick Stevens; David L Valentine; Rebecca Vega Thurber; Linda Wegley; Bryan A White; Forest Rohwer
Journal: Nature Date: 2008-03-12 Impact factor: 49.962

4. Metagenomic and stable isotopic analyses of modern freshwater microbialites in Cuatro Ciénegas, Mexico.

Authors: Mya Breitbart; Ana Hoare; Anthony Nitti; Janet Siefert; Matthew Haynes; Elizabeth Dinsdale; Robert Edwards; Valeria Souza; Forest Rohwer; David Hollander
Journal: Environ Microbiol Date: 2008-09-01 Impact factor: 5.491

5. Convergent domestication of pogo-like transposases into centromere-binding proteins in fission yeast and mammals.

Authors: Claudio Casola; Donald Hucks; Cédric Feschotte
Journal: Mol Biol Evol Date: 2007-10-16 Impact factor: 16.240

Review 6. Integrating prokaryotes and eukaryotes: DNA transposases in light of structure.

Authors: Alison Burgess Hickman; Michael Chandler; Fred Dyda
Journal: Crit Rev Biochem Mol Biol Date: 2010-02 Impact factor: 8.250

7. Low genomic diversity in tropical oceanic N2-fixing cyanobacteria.

Authors: Jonathan P Zehr; Shellie R Bench; Elizabeth A Mondragon; Jay McCarren; Edward F DeLong
Journal: Proc Natl Acad Sci U S A Date: 2007-10-30 Impact factor: 11.205

Review 8. Exploring prokaryotic diversity in the genomic era.

Authors: Philip Hugenholtz
Journal: Genome Biol Date: 2002-01-29 Impact factor: 13.583

9. High abundance of virulence gene homologues in marine bacteria.

Authors: Olof P Persson; Jarone Pinhassi; Lasse Riemann; Britt-Inger Marklund; Mikael Rhen; Staffan Normark; José M González; Ake Hagström
Journal: Environ Microbiol Date: 2009-02-04 Impact factor: 5.491

10. Methods for comparative metagenomics.

Authors: Daniel H Huson; Daniel C Richter; Suparna Mitra; Alexander F Auch; Stephan C Schuster
Journal: BMC Bioinformatics Date: 2009-01-30 Impact factor: 3.169

113 in total

1. A metagenome of a full-scale microbial community carrying out enhanced biological phosphorus removal.

Authors: Mads Albertsen; Lea Benedicte Skov Hansen; Aaron Marc Saunders; Per Halkjær Nielsen; Kåre Lehmann Nielsen
Journal: ISME J Date: 2011-12-15 Impact factor: 10.302

2. The role of vertical and horizontal transfer in the evolution of Paris-like elements in drosophilid species.

Authors: Gabriel Luz Wallau; Valéria Lima Kaminski; Elgion L S Loreto
Journal: Genetica Date: 2012-04-24 Impact factor: 1.082

3. An Atypical AAA+ ATPase Assembly Controls Efficient Transposition through DNA Remodeling and Transposase Recruitment.

Authors: Ernesto Arias-Palomo; James M Berger
Journal: Cell Date: 2015-08-13 Impact factor: 41.582

Review 4. Selfish genetic elements, genetic conflict, and evolutionary innovation.

Authors: John H Werren
Journal: Proc Natl Acad Sci U S A Date: 2011-06-20 Impact factor: 11.205

5. Unlocking Tn3-family transposase activity in vitro unveils an asymetric pathway for transposome assembly.

Authors: Emilien Nicolas; Cédric A Oger; Nathan Nguyen; Michaël Lambin; Amandine Draime; Sébastien C Leterme; Michael Chandler; Bernard F J Hallet
Journal: Proc Natl Acad Sci U S A Date: 2017-01-17 Impact factor: 11.205

6. Insertion sequences enrichment in extreme Red sea brine pool vent.

Authors: Ali H A Elbehery; Ramy K Aziz; Rania Siam
Journal: Extremophiles Date: 2016-12-03 Impact factor: 2.395

7. Bacterial growth at -15 °C; molecular insights from the permafrost bacterium Planococcus halocryophilus Or1.

Authors: Nadia C S Mykytczuk; Simon J Foote; Chris R Omelon; Gordon Southam; Charles W Greer; Lyle G Whyte
Journal: ISME J Date: 2013-02-07 Impact factor: 10.302

8. FXR-Dependent Modulation of the Human Small Intestinal Microbiome by the Bile Acid Derivative Obeticholic Acid.

Authors: Elliot S Friedman; Yun Li; Ting-Chin David Shen; Jack Jiang; Lillian Chau; Luciano Adorini; Farah Babakhani; Jeffrey Edwards; David Shapiro; Chunyu Zhao; Rotonya M Carr; Kyle Bittinger; Hongzhe Li; Gary D Wu
Journal: Gastroenterology Date: 2018-08-23 Impact factor: 22.682

9. CENP-B cooperates with Set1 in bidirectional transcriptional silencing and genome organization of retrotransposons.

Authors: David R Lorenz; Irina V Mikheyeva; Peter Johansen; Lauren Meyer; Anastasia Berg; Shiv I S Grewal; Hugh P Cam
Journal: Mol Cell Biol Date: 2012-08-20 Impact factor: 4.272

10. De Novo assembly of the complete genome of an enhanced electricity-producing variant of Geobacter sulfurreducens using only short reads.

Authors: Harish Nagarajan; Jessica E Butler; Anna Klimes; Yu Qiu; Karsten Zengler; Joy Ward; Nelson D Young; Barbara A Methé; Bernhard Ø Palsson; Derek R Lovley; Christian L Barrett
Journal: PLoS One Date: 2010-06-08 Impact factor: 3.240