Literature DB >> 30546254

Prediction of Horizontally and Widely Transferred Genes in Prokaryotes.

Abstract

Horizontal gene transfer (HGT) is the process whereby an organism acquires exogenous genes (horizontally transferred genes or HT genes) that are not inherited from the parent, but are derived from another organism. In prokaryotes, HGT has been considered as one of the important driving forces of evolution. Previously, genome-wide analyses have been conducted for estimating the proportion of HT genes in prokaryotic genomes, but the number of species examined at the time was limited, and gene annotation was relatively poor. Currently, tens of thousands of prokaryotic genomes have been published and gene annotation resources have improved. In the present study, HT gene prediction method was modified so that the estimate was robust to gene length, conducting a comprehensive search using 3017 representative prokaryotic genomes belonging to 1348 species. The result showed that an average of 13% (ranging from 0% to 30% across species) of protein-coding genes was predicted as being of horizontal origin. The proportion of the predicted HT genes per species was associated with the species' habitat, while a positive correlation between the proportion and genomic nucleotide frequency was also observed. Moreover, the functions of the predicted HT genes were inferred and compared according to two popular databases, the Clusters of Orthologous Groups and the Kyoto Encyclopedia of Genes and Genomes. As a result, both databases indicated that many of the widely transferred genes were involved in mobile genetic elements (transposons, phages, and plasmids) as expected. Notably, the present study predicted that six as-yet-uncharacterized genes were widely distributed HT genes, and therefore, will be interesting targets for evolutionary studies. Thus, this study demonstrates that a data-driven approach using massive sequence data may contribute to a broader understanding of HGT in prokaryotes.

Entities: Chemical Disease Gene Species

Keywords: Horizontal gene transfer; nucleotide composition; prokaryotic genomes

Year: 2018 PMID： 30546254 PMCID： PMC6287321 DOI： 10.1177/1176934318810785

Source DB: PubMed Journal: Evol Bioinform Online ISSN： 1176-9343 Impact factor: 1.625

Introduction

Horizontal gene transfer (HGT), or lateral gene transfer, is the phenomenon in which an organism gains genes from another unrelated organism. This phenomenon is often mediated by mobile genetic elements, such as plasmids and viruses, which incorporate the host’s nuclear DNA into their own genomes and carry it to another host.[1] HGT is influential particularly in unicellular organisms such as prokaryotes, because the genetic change caused by the insertion of foreign DNA is directly transmitted to progeny. If a transferred gene is not homologous to any genes in the population of recipient species, a novel allele or phenotype may arise more quickly than when caused by mutations in resident genes. As such, in prokaryotes, HGT has been considered as an important evolutionary driving force,[2,3] or in other words, an accelerator of evolution. Comparative genome studies have been conducted for the systematic prediction of genes transferred from another organism, namely horizontally transferred genes (HT genes).[3] In practice, molecular phylogenetic analysis is considered the most robust method for the detection of HGT, where topological inconsistency is evidence of the occurrence of HGT.[4] However, to expand the analysis from the individual gene level to the genome level, homologous sequences must be heuristically collected for each of the target genes. This is not easy at a time when genome data are accumulating at an extremely high speed: about three genomes/day for prokaryotes (about 70 genomes/day if contig data are also counted) in 2015 (https://www.ncbi.nlm.nih.gov/genome/microbes/). In addition, orphan genes with poor homologs in genome databases are not amenable to HT gene prediction. Although orphan genes may be considered to be of horizontal origin in the context of pan-genome,[5] the scope of a pan-genomic study is restricted to a specific lineage and is not applicable to comparing gene flow among distantly related species. Another approach for HT gene prediction, particularly at the whole genome level, involves a DNA sequence scan that computes nucleotide composition such as codon usage.[6,7] In prokaryotes, the nucleotide frequency is generally homogeneous across the entire genome, where regions of abnormal nucleotide composition often derive from other organisms. For example, codon usages have been examined in protein-coding genes in Escherichia coli,[7,8] and on the basis of these compositions, genes can be classified into three groups: normal genes, highly expressed (HE) genes, and HT genes. A merit of the nucleotide composition method is its speed. More specifically, the prediction of HT genes is applied to individual genomes, without any comparison to homologous sequences in other genomes. Comprehensive searches using the nucleotide composition method suggest that a substantial amount of genes have arisen from HGT in many prokaryotic species.[9-11] On the other hand, despite the ease of using this approach, there are some ambiguities in the theoretical basis for the nucleotide composition method, one of which is statistical validation. Intuitively, the statistics for measuring nucleotide composition bias should be influenced by sequence length, but the effect has not sufficiently been taken into account in previous methods. One solution is to remove shorter (or longer) genes in analysis[8,12] but, instead, a number of the genes are never examined and the cutoff criterion is unclear. Previously, based on statistical testing, the nucleotide composition method estimated that about 12% of protein-coding genes per genome were of horizontal origin, and functional annotation using a gene database[13] quantitatively suggested that pathogenicity-related genes were frequently transferred.[10] However, the number of genomes examined and the content of the database were limited when this study was published. Recently, genome sequencing has been achieved at faster rates using high-throughput DNA sequencing,[14] and as a result, gene annotation databases have expanded.[15,16] Therefore, it has now become possible for nearly a thousand genomes to be compared in the context of gene gain and loss.[17] The aim of the present study is to find as-yet-recognized HT genes in prokaryotic genomes using the currently available data, in parallel with the elucidation of overall tendency in HGT. In particular, widely transferred genes among taxa were the primary focus, rather than taxon-specific transferred genes. The HT gene candidates were predicted using a novel nucleotide composition analysis method unaffected by variations in gene length. In this study, more than 3000 representative prokaryotic genomes were surveyed to provide a basis for understanding gene flow among prokaryotes from a variety of environments.

Materials and Methods

Prokaryotic genome sequences

The representative genomic sequence data were downloaded from the ftp site of the National Center for Biotechnology Information (NCBI) according to the list as of March 14, 2014 (ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prok_representative_genomes.txt). The sequence data were originally composed of 3578 genomes, and the valid genomes were selected using the four following criteria: (1) the scientific name is given (not Candidatus species); (2) genome sequence is not fragmented; (3) gap region is small (<5% of the genome); and (4) protein-coding regions are predicted. Finally, a total of 3017 genomes were selected for HT gene prediction. Based on the taxonomical information from NCBI, these genomes were summarized into 1348 species, which of “sp.” in the same genus were clustered into a single group for convenience, and 661 genera. The genomes examined are listed in Supplementary Table 1.

Calculation of HT gene index

To predict HT genes, an index was computed as an indicator of the frequency bias of the adjoining codons in protein-coding genes (the software can be freely downloaded at https://github.com/yjnkmr/hgt). This index was derived from an output probability of the gene sequence based on a Markov chain model. First, for each genome, the transition matrix, {p}, was computed using all of the protein-coding sequences. In the matrix, p(c) (m, n = 1, 2,…, 64) is defined as a conditional probability for when codon m appears at a codon position, codon n appears at the next position. For example, p(TTT|AAA) represents a conditional probability that when “AAA” appears at a codon position “TTT” appears at the next position. When m is fixed, p(c) constitutes the probability vector: Using the transition matrix, {p}, the output probability for the gene sequence is represented as follows: Here, P0 is an initial probability for the first codon, p is the p assigned from the two observed codons at the ith and i + 1th positions, and L is the gene length (ie, the number of codons) excluding the first codon and stop codon. P0 was computed from the overall codon frequency in the target genome. This kind of representation using a Markov chain of codons or nucleotide tuples has been traditionally applied in gene-finding algorithms.[18,19] The basic idea in this study was to apply the output probability, P, for the prediction of HT genes. Since the magnitude of P is dependent on gene length L, the geometric mean was used like the codon adaptation index (CAI).[20] Furthermore, the logarithm was calculated: As far as the full-length coding sequences are concerned, the first codon is one of the start codons (eg, “ATG” [triplet of nucleotides]), and hence P0 is likely to deviate from the overall codon frequency. Therefore, the simplified formula without P0 was finally used: Under the null hypothesis, where each gene sequence is the output from the transition matrix, {p}, the expected value of I does not depend on gene length because the index is normalized by L. However, deviations in I should depend on gene length, where I values for shorter genes will be distributed with larger deviations. To validate this effect, a Monte Carlo simulation was performed. The expected distributions of I were computed using the artificial sequences of L = 20, 23, 28, 34, 42, 54, 72, 100, 150, 240, 460, and 1200 codons. These codon sizes were chosen by taking into account an interval of (1/L)1/2. For each codon size, a total of 100,000 artificial sequences of L + 1 codons were generated from the initial probability, P0, and the transition matrix, {p}, and the Is for the artificial sequences were computed. As a result, the distribution obtained by Monte Carlo simulation was a bit heavy-tailed compared with the normal distribution, particularly with reference to shorter genes (Figure 1A), while the variance seemed to be proportional to 1/L. In the case of longer genes, the expected distribution seemed to converge to a normal distribution. Therefore, the tail of the standardized I was approximated using a t-distribution with the degree of freedom, L + a (Figure 1B), where a was fixed using the least squares method.

Figure 1.

Q-Q plots of simulated I values. The results of the simulation using the gene set from the K-12 MG1655 strain of Escherichia coli (accession number: U00096) are shown. (A) Quantiles of simulated I values compared with those of the standard normal distribution, N (0,1). Each of the plots indicates the quantile of I (Q0,05, Q0.01, or Q0.001) in each gene length, L. (B) Q-Q plot of I values (L = 20) for the standard normal distribution (black) and for the t-distribution (degree of freedom = L + 15 = 35) (red). The simulated I values were standardized for comparison to the standard normal distribution, the values in the third quadrant are plotted, and three bottom quantiles (Q0,05, Q0.01, or Q0.001) are shown as dashed lines. The HE genes, which include genes encoding chaperones, elongation factors, and ribosomal proteins, have specialized codon usages.[8,21] Therefore, it is possible that these genes could be predicted as artifacts. In this study, to correct the prediction of HGT, a transition matrix for HE gene sequences was also prepared for each of the 3017 genomes. First, prokaryotic HE genes were collected from the UniProt database,[22] with reference to Karlin and Mrazek’s gene list (Table 2 in Karlin and Mrazek[21]). Next, all gene sequences in the 3017 genomes were compared with the HE gene sequences using BLASTP (E-value <10−5),[23] and then the candidates obtained were further checked using the profile models constructed by HMMER3 (http://hmmer.org/). The transition matrix for only HE genes, {p}, was computed by the aforementioned formulae, and the transition matrix for all genes, {p}, was modified to {p} by subtracting the HE gene sequences from all gene sequences. Finally, two Is for each gene were computed from the independent transition matrices, {p} and {p}, respectively, and the genes having significantly small Is were collected as putative HT genes. The Monte Carlo simulations mentioned above were performed and statistical significance levels were set to 0.01 for both {p} and {p}, thereby, α = 0.01 × 0.01 = 0.0001. Since prokaryotic genomes have about 102-104 protein-coding genes (3 × 103 on average in the present study), the threshold of α = 0.0001 indicates that at most, one gene per genome may be falsely predicted by {p} and {p} simultaneously by chance.

Functional annotation of HT genes

The annotated protein sequences were downloaded from two public databases, the Clusters of Orthologous Groups (COG) from NCBI[16] and the Kyoto Encyclopedia of Genes and Genomes (KEGG) from Kyoto University.[15] The sequences were already clustered into ortholog groups defined by the COG number (eg, COG0001) in the COG database, and the KEGG Ortholog (KO) number (eg, K00001) in the KEGG database, many of which are attributed to one or more specific functions. In the COG database, each COG number is classified into one or more of 25 upper categories (eg, COG0001 is classified as “coenzyme transport and metabolism”). There are originally 26 categories in the COG database (ftp://ftp.ncbi.nih.gov/pub/COG/COG2014/static/lists/homeCOGs.html), but no COGs are classified into “nuclear structure” (one-letter COG category code = Y). Since the KO groups often contain redundant sequences derived from different strains of the same species, for each of the ortholog groups, 95% identical sequences were clustered using CD-HIT software[24] and representative sequences in the clusters were used. Regarding the functionally uncharacterized proteins, the domains or motifs were predicted using InterProScan.[25] The functions of genes in the 3017 prokaryotic genomes were then inferred using COGsoft[26] to the COG and KEGG protein sequences, according to the developer’s instruction (https://sourceforge.net/projects/cogtriangles/files/). In the KEGG annotation, many of the KO numbers were already mapped to COG numbers. Such a correspondence was accepted when the shared genes occupied at least 50% of the genes in either group. With reference to each of the attributed ortholog groups, the genes of horizontal origin were counted. To assess the count bias of HT genes attributed to a functional category defined in the COG or an ortholog group defined in the COG or KEGG, an indicator, named the g index, was defined as the following: Here, O is the observed count of genes and E is the expected count of genes, and suffixes x and n denote “HT” and “non-HT” respectively. Therefore, the term in parenthesis is equivalent to half of the G statistic used in a likelihood ratio test.[27] The expected count of genes was computed from the average proportion of HT or non-HT genes, respectively. In addition, when the observed count of HT genes, O, is larger/smaller than the expected count, E, then g becomes a positive/negative value because of c. When O is zero, O ln(O/E) is defined as zero. Thus, g is associated with a χ2 probability in a contingency table analysis, and at the same time, represents a bias in the observed count of HT genes when compared with the expected count (range, –∞ < g < ∞).

Phylogenetic analysis

For each of the functionally uncharacterized HT genes predicted in this study, the phylogenetic tree was constructed: first, protein sequences in the target ortholog group were collected according to the aforementioned COGsoft annotation. Here, to avoid redundant sampling from different strains of the same species, the sequences were collected from representative genomes of the species. Since the homologous gene set obtained contained many sequences, some of which were too short for alignment, the sequences were sorted by amino acid length and those under the first quantile (<25%) were filtered out. When archaeal sequences were included in the gene set, those were separately analyzed. The pairwise alignment was then constructed using MUSCLE[28] and a distance matrix was computed with Kimura’s correction.[29] To avoid violations by irregularly aligned sequences, the outlier sequences whose average distances to other homologous sequences were in the top 5 percentile (ie, those seemed to have much diverged from the others) were removed. After that, the distance matrix was computed again, and the phylogenetic tree was constructed by neighbor-joining method.[30]

Results

Proportion of HT genes predicted from 3017 prokaryotic genomes

For all of the 3017 representative prokaryotic genomes, putative HT genes were identified using the index, I (P < 10−4) (Supplementary Table 1). The proportion of HT genes in each genome ranged from 0% to 30%, with an average of 13.0% (Figure 2A). To avoid sampling bias due to the presence of redundant species (eg, the 3017 genomes examined include 217 strains of Salmonella enterica, but only one of Edwardsiella ictaluri), the 3017 genomes were summarized into 1348 species, and the gene counts for each of the species were averaged. As a result, the overall average proportion of HT genes was 13.1% across 1348 species (Figure 2B), which is close to that computed for the full 3017 genomes. Similarly, when 1348 species were summarized into 661 genera, the overall average was 13.4% (Figure 2C). The HT genes were not unimodally distributed around the average (Figure 3). Based on these results, the genomes examined can be split into two or more groups, namely those with low HT gene proportions (<12%) and those with high HT gene proportions (>12%).

Figure 2.

Figure 3.

A distribution of HT gene proportions predicted in 1348 species.

Proportion of HT genes in prokaryotic genomes, species, and genera. The proportions are arranged in descending order for genome (A), species (B), and genus (C), respectively. In each panel, the average proportion is shown in a dash line. Proportions of the genes attributed to at least an ortholog group based on the COG (D) and KEGG (E) databases. The proportions for all genes and HT genes are denoted as A and X, respectively. A distribution of HT gene proportions predicted in 1348 species. Seeing as 821 out of 1348 species had the information of habitat in the NCBI database, the proportions of HT genes were compared based on the species habitats: aquatic, host-associated, multiple, specialized, and terrestrial (Figure 4). Regarding the species from multiple and terrestrial habitats, the proportions of HT genes (14.6% and 14.2% in average, respectively) were larger than the overall average (13.1%). The proportion in species from aquatic habitats (13.4% in average) was close to the overall average. The species from specialized and host-associated habitats tended to have less HT genes (11.4% and 11.6% in average, respectively) than those in the other three habitats (Wilcoxon rank sum test: P < .05 with Bonferroni correction), while the gene proportion for species from host-associated habitats ranged widely across species (min = 0.5% and max = 29.8%). Similarly, the species examined were classified according to three properties (motility, oxygen requirement and temperature range), respectively, and the comparison showed higher proportions of HT genes in motile, aerobic, and mesophilic species, respectively (Figure 4). The proportion of HT genes correlated with genome guanine-cytosine (GC) content (correlation coefficient: r = 0.73) (Figure 5A). Genomes with lower GC content (ie, AT-rich genomes) had smaller proportions of HT genes, which corresponded to the aforementioned group of low HT gene proportions. The proportion of HT genes was also correlated to genome size (r = 0.46) (Figure 5B). However, the partial correlation was weak between genome size and HT gene proportion (r = 0.13), while that between GC content and the HT gene proportion was relatively strong (r = 0.64). The positive correlations between the GC content and the HT gene proportion were observed in all the species groups classified according to the habitat (Figure 6).

Figure 4.

Figure 5.

Correlation among the proportion of HT genes, genomic GC content, and genome size in 1348 species. Correlations between the proportion of HT genes, and either genomic GC content (A) or genome size (B).

Figure 6.

Correlation between the proportion of HT genes and genomic GC content in each of five habitats.

Relationships between the proportion of HT genes and four properties in 1348 species. The difference in proportion of HT genes between two categories from groups 1 and 2 (eg, “multiple” and “specialized” in habitat) was statistically significant (Wilcoxon rank sum test: P < .05 with Bonferroni correction). Correlation among the proportion of HT genes, genomic GC content, and genome size in 1348 species. Correlations between the proportion of HT genes, and either genomic GC content (A) or genome size (B). Correlation between the proportion of HT genes and genomic GC content in each of five habitats. The length of HT genes was compared with previously analyzed results regarding 135 genomes[10] (originally 114 genomes, 28 were later updated, and 7 were excluded) (Figure 7). In all the genomes examined, the median length of HT genes predicted using the previous method was smaller than that for all genes (Figure 7A). Contrastingly, in the goodness-of-fit test for codon usage, longer genes were preferentially predicted as the genes with abnormal codon frequency (Figure 7B). However, using the present method, the median lengths of HT genes differed from species to species, and the prediction of HT genes was independent of gene length (Figure 7C).

Figure 7.

HT gene length compared with all of the examined genes. The graph depicts the median gene length (amino acid residues) for all of the examined genes (x-axis) and predicted HT genes (y-axis). Each dot corresponds to one genome. The genomes with few putative HT genes (<3) are not used. The HT genes were (A) predicted from 135 genomes according to the previous method,[10] (B) predicted from 3017 genomes using a goodness-of-fit test (P < .01 for both the HE gene set and gene set containing all genes) with the G statistic for codon usage, and (C) predicted from 3017 genomes in the present study. In (C), the 135 genomes examined in the previous study are plotted in black.

Transferable gene functions

The functions of the genes in the 3017 genomes were inferred from the protein sequences annotated in the COG and KEGG databases using COGsoft (Figure 2D and E). A substantial percentage of genes, 80.9% with COG and 76.3% with KEGG, were attributed to at least one ortholog group. Out of the predicted HT genes, 66.3% and 60.1% were attributed to at least one ortholog group in the COG and KEGG, respectively. When averaged by species and genus, the results were similar: 80.8%/74.7% of all of the genes and 67.2%/59.0% of the HT genes were attributed to at least one ortholog group in the COG/KEGG at the species level, and 81.3%/74.3% of all of the genes and 67.4%/57.9% of the HT genes were attributed to at the genus level. At the species level, the HT gene proportion bias within each ortholog group was evaluated using the g index, a signed half of the G statistic. First, g indices were ranked within 25 functional COG categories to the genes assigned to those categories through COGsoft annotation. As a result, mobilome-related genes (one-letter COG category code = X) were overwhelmingly predicted as HT genes among 1348 species (Figure 8). Except for mobilome-related genes, three categories were frequently predicted as HT genes, namely, “replication, recombination, and repair” (L), “extracellular structures” (W), and “defense mechanisms” (V). These categories (L, W, and V) were also top three except for mobilome-related genes at the genus level (Supplementary Figure 1, top). On the other hand, genes in “translation, ribosomal structure, and biogenesis category” (J) were the least transferable as well as at the genus level.

Figure 8.

Function bias in putative HT genes. The COG categories are ranked by g index in descending order, and those between −2000 < g < 2000 are also shown.

Function bias in putative HT genes. The COG categories are ranked by g index in descending order, and those between −2000 < g < 2000 are also shown. Next, g indices of the COG ortholog group were ranked with the top 50 (Figure 9 and Supplementary Table 2). Many of the ranked ortholog groups were of transposase-coding genes classified as mobilome-related genes. Moreover, the groups involved in the viral life cycle or plasmid maintenance (COG0582, COG1961, COG4974, and COG3668) were also related to mobilome genes. Additionally, the ortholog groups involved in DNA restriction or modification (COG1403, COG0732, COG0270, COG0863, and COG0286), transcriptional regulation (COG1396, COG2207, COG3311, and COG0583), the secretion system (COG3505, COG4104 and COG3843), and the pilus or cell surface (COG0438, COG2244, COG1835, COG3539, and COG3307) were predicted as widely transferred genes. Moreover, the list included four uncharacterized genes (COG3209, “uncharacterized conserved protein RhaS, contains 28 RHS repeats”; COG1479, “uncharacterized conserved protein, contains ParB-like and HNH nuclease domains”; COG3291, “PKD repeat”; and COG3791, “uncharacterized conserved protein”), one of which was categorized as “general function prediction only” (R), and the others categorized as “function unknown” (S) in the COG database. In comprehensive phylogenetic analysis, disorders in branching pattern were frequently observed in these genes compared with those of the least transferable category, J (Figure 10 and Supplementary Figures 2 and 3). As a result of InterPro analysis (Figure 11), more than half of COG3209 and COG3291 genes included RHS repeat (IPR022385) and immunoglobulin-like fold (IPR013783) domains, respectively, and most of COG1479 genes included domain of unknown function DUF262 (IPR004919). Regarding COG3791, almost all of the genes were attributed to Mss4-like superfamily (IPR011057) and glutathione-dependent formaldehyde-activating enzyme/centromere protein V (IPR006913).

Figure 9.

Figure 10.

Comparison of phylogenetic trees between transferable and non-transferable genes. (Left) COG1479 (uncharacterized conserved protein, contains ParB-like and HNH nuclease domains: COG category code = S). (Right) COG0080 (ribosomal protein L11: COG category code = J). Each of the edges denotes an individual gene and is colored according to the phylum of the bacterium carrying the gene. The predicted HT genes are plotted by cross, otherwise by circle.

Figure 11.

InterPro matches in six uncharacterized HT genes. Top four (COG3209, COG1479, COG3291, and COG3791) and bottom two (K08998 and K07062) were predicted from the COG and KEGG databases, respectively.

Top 50 transferable genes based on COG/KEGG annotations. Uncharacterized genes are shown in red (COG category code = R) or orange (COG category code = S) based on the COG annotation (left), and blue based on the KEGG annotation (right). The corresponding gene groups between the COG/KEGG annotations are linked by lines. Comparison of phylogenetic trees between transferable and non-transferable genes. (Left) COG1479 (uncharacterized conserved protein, contains ParB-like and HNH nuclease domains: COG category code = S). (Right) COG0080 (ribosomal protein L11: COG category code = J). Each of the edges denotes an individual gene and is colored according to the phylum of the bacterium carrying the gene. The predicted HT genes are plotted by cross, otherwise by circle. InterPro matches in six uncharacterized HT genes. Top four (COG3209, COG1479, COG3291, and COG3791) and bottom two (K08998 and K07062) were predicted from the COG and KEGG databases, respectively. The top 10 ortholog groups in the above-mentioned three categories containing frequently transferred genes, that is, categories L, W, and V, are shown in Supplementary Table 3. Although the genes in category L included DNA modification genes, many were probably involved in viral life cycle (see section “Discussion”). Almost all of the genes in category W encoded a pilus-related protein. Seven out of the top 10 genes in category V were DNA restriction or modification genes, one was a plasmid maintenance gene, and two were related to bacterial defense systems against viruses, such as the clusters of regularly interspaced short palindromic repeats (CRISPR) system. Similarly, g indices were computed and ranked with reference to KO groups (Figure 9 and Supplementary Table 4). The results revealed that most of the frequently transferred genes were common to those based on the COG annotations (although some of the descriptions of corresponding genes were slightly different between the databases). The most transferable ortholog group (K07497) was of a transposase gene, congruent with the top in the COG database (COG2801). The second most transferable group, K04763, was also a counterpart of COG0582 (integrase). Although COG0582 was split into five groups in the KEGG database (Supplementary Table 2), the K14059 group was also predicted to include frequently transferred genes (Figure 9). Two groups (K08998 and K07062: “uncharacterized protein”) were uncharacterized in the KEGG annotation. The COG counterparts of K08998 and K07062 were COG0759 (membrane-anchored protein YidD, putative component of membrane protein insertase Oxa1/YidC/SpoIIIJ) categorized into “cell wall/membrane/envelope biogenesis” (M), and COG1487 (predicted nucleic acid-binding protein, contains PIN domain) categorized into “general function prediction only” (R), respectively. InterPro analysis also supported the functional annotations about K08998 (COG0759) and K07062 (COG1487) genes (Figure 11).

Discussion

In the present study, putative HT genes were identified across more than 3000 representative prokaryotic genomes using a novel method based on nucleotide composition (ie, codon usage). A merit of the nucleotide composition method is that gene prediction can be performed for a single genome, in contrast to the phylogenetic method that requires a comprehensive comparison across all of the related genomes. Therefore, it is possible in the future that the prediction of HT genes can be performed automatically following genome sequencing, and the maintenance of result data is easy. However, nucleotide composition is an indirect indicator of HGT, although phylogenetic analyses can directory prove it. For example, it may be difficult to detect HGT between closely related species or HGT in AT-rich genomes by nucleotide composition methods (as discussed below). As a fundamental problem in statistics, in particular, nucleotide composition-based methods are affected by sequence length. In the case of goodness-of-fit test for codon usage, longer genes were preferentially predicted as outliers (ie, HT genes). Moreover, the previously described method based on nucleotide composition[10] tended to predict shorter genes as HT genes. Here, one may think that shorter genes might be actually preferred in HGT, because mobile genetic elements, such as viruses or plasmids, cannot easily carry long DNA under the constraints defined by their compact structures. This hypothesis is worthy of being verified, but first, stochastic effects in the mathematical model need to be carefully removed. The solution in this study is to represent the distribution parameter by a function of gene length. To this aim, the index, I, was developed as represented by a simple formula, which is statistically easy to deal with. The expected heavy-tail of standardized I was approximated by t-distribution and the degree of freedom was determined depending on gene length. Thus, the statistical significance of I was evenly calculated for genes of any length. Actually, the prediction results for HT genes in this study seem to be unbiased by gene length, and shorter genes were not significantly preferred in HGT. In addition, it should be noted that 11.1% of protein-coding genes in the 3017 genomes examined were relatively short (<100 codons). As a result of this study, it has become possible to analyze these genes. In this study, the correlations between HGT and the properties in prokaryotes were examined. Regarding the habitat, the species in specialized and host-associated habitats tended to have lower proportions of HT genes, suggesting that HGT is rare in a limited or closed environment. This result can be explained by the lower chance of gene flow between different species in such an environment. Conversely, the species in multiple, terrestrial habitats were relatively rich in HT genes, which may be due to the higher chance of gene gain from a variety of organisms in environments. This might be also the case in motility of host species: motile species were richer in HT genes than non-motile species (Figure 4). Since the five attributed habitats include a wide range of environments (eg, “aquatic” includes freshwater and seawater), further analysis will be necessary for understanding a detailed correlation between habitat and HGT. Here, it should be noted that despite the least proportion of HT genes among the habitats, the estimates in host-associated habitats varied depending on species (0.5%-29.8%). This observation may be correlated with frequent HGT in pathogenetic or symbiotic prokaryotes.[10,31] Actually, some of the species in host-associated habitats are pathogens (eg, Neisseria) or symbionts (eg, Rhizobium) to the host, and these showed high proportions of HT genes (>~20%) (Supplementary Table 1). On the other hand, HGT may be rare in the endosymbiotic species: no, or almost no, HT genes were detected in Buchnera and Blattabacterium in this study. There was a positive correlation between HT gene proportion and genomic GC content, indicating that the two modes of HT gene proportion, namely the groups of low/high HT gene proportions, corresponded to AT-rich/GC-rich genomes, respectively. This observation can be explained by two possibilities. The first is that HT genes are often AT-rich, similar to intrinsic genes in AT-rich genomes, and hence might be immune to detection using nucleotide composition bias. The second possibility, from a genome evolution perspective, is that prokaryotes with AT-rich genome have rarely undergone gene gains. Endosymbiotic or obligately parasitic prokaryotes in host-associated habitats often have compact and AT-rich genomes,[32] and such compactness is due to decreasing the number of genes by deletion, rather than increasing the number of genes by duplication or HGT. This may be the case for Buchnera and Blattabacterium as mentioned above. The two scenarios that could account for the correlation between the HT gene proportion and genomics GC content, one being a limitation of the method and the other being biologically reasonable, are not mutually exclusive. Seeing as positive correlations between the HT gene proportion and genomic GC content were also observed in the species from other habitats (Figure 6), the first scenario is more strongly represented in the present study than the second scenario. Thus, some of the proportions of HT genes in AT-rich genomes might be underestimated, and additional studies will be required to solve this problem. In previous research, the putative HT genes were rich in genes with unknown functions.[10] It could be considered that such a result was caused by the paucity of the database contents. As the gene annotation databases have been updated since the previous study was published, HT gene annotation was improved in the present study. The index, g, was used to measure bias in the functionally categorized HT genes, which is also influenced by the sample size as with χ2 statistic. The value of g may be more sensitive to the genes common to a large number of taxa than a small number of taxa. For this reason, the g index is suitable for detecting the overall tendency for HGT: it can detect widely distributed HT genes that cross among relatively broad taxa, rather than limited HT genes that are frequently exchanged within specific taxa. In this study, mobilome genes were overwhelmingly predicted as widely transferred genes, which is biologically reasonable. The genes involved in translation, such as ribosomal proteins (COG category J), were the least transferable. This result seems methodologically obvious, because many of the genes in this category were used in the HE gene model. However, when the I for only the all-minus-HE gene model was computed, g index of category J genes was still negative and the fourth lowest among all of the categories (Supplementary Figure 1, bottom). Therefore, in the present method, the I calculation using the HE gene model may be a dispensable step. To avoid an annotation bias arising from the database used, both the COG data and KEGG data were used. The resource genomes, ortholog grouping method, and repertoire of annotated genes are different between these two databases. In particular, the ortholog groups in KEGG do not always correspond to those in COG on one-to-one level (Figure 9). For example, COG3209 (uncharacterized conserved protein RhaS, contains 28 RHS repeats) has no counterpart in KEGG, and the single group, COG0582 (integrase), corresponds to five groups in KEGG, such as K04763 and K14059. In general, single COG group is divided into multiple groups in the KEGG database, and the number of species having a COG group is seemingly larger than the number having counterparts in the KEGG database. Thus, proportions of HT genes and g indices should be different between COG and KEGG groups. Nevertheless, similar results were obtained for COG and KEGG groups. For example, mobilome genes were frequently observed using both databases. This result itself is not surprising, but it should be noted that COG2801 (K07497) was detected as having the most transferable genes in both the COG and KEGG databases. According to the g index, this gene is the most widely transferred gene among prokaryotes. The second most transferable gene based on the COG database is of COG0582, which corresponds to K04763 (integrase/recombinase XerD) and possibly K14059 (integrase) in the KEGG database. Although XerD is a recombinase required for sister chromosomal segregation in prokaryotic DNA replication, XerD and its homolog, XerC,[33] are homologous to phage integrase genes in the same family.[34,35] Therefore, many of the genes in the COG0582 group may be derived from mobile element genes. Note that COG4974 is attributed to XerD in the COG database and K03733 is attributed to “integrase/recombinase XerC” in the KEGG database (Supplementary Tables 2 and 4), implying that the classification of this integrase family is confusing between the databases. If the counts for xerC and xerD genes that are required for host DNA replication are removed from the COG0582 count, the g index will be modified. The HT genes classified into category L (Supplementary Table 3), except for DNA modification genes, may be involved in the viral life cycle. For example, COG1484 is attributed to DnaC that is required for prokaryotic DNA replication, but the homologs are found also in phages.[36] In contrast, the HT genes in category V are involved in the host’s defense system against phages, such as genes of restriction enzyme or of CRISPR/Cas system. The genes involved in the restriction-modification system[37] and CRISPR[38,39] are considered to be frequently transferred among prokaryotic genomes, respectively. It has also been reported that pilus-related genes in category W have been transferred as pathogenicity-related genes.[40] Furthermore, secretion system genes were predicted as frequently transferred genes in both the COG and KEGG database. The secretion system is often involved in the pathogenesis of bacteria, and the responsible genes are located tandemly as a large cluster that is subject to HGT.[41,42] Thus, as a whole, widely transferred genes detected in this study were those reported as HT genes in the previous case studies, suggesting the usefulness of the nucleotide composition method in treating a huge amount of genomic data. Focusing on uncharacterized HT genes in the COG and KEGG databases, four gene groups (COG3209, COG1479, COG3291, and COG3791) were classified into “general function prediction only” (COG category code = R) or “function unknown” (S) according to the COG annotation. By adding two gene groups of “uncharacterized protein” (K08998 and K07062) from the KEGG annotation, a total of six were obtained as functionally ambiguous HT gene groups despite being distributed among more than 300 species. With reference to COG3209 (uncharacterized conserved protein RhaS, contains 28 RHS repeats), the genes encoded in E. coli have been suggested to be horizontally transferred from another organism.[43] Recently, RHS repeat-containing genes are reported to be involved in toxins against competitors[44]; therefore, the genes in COG3209 could be considered as defense system genes. The functions of COG1479 (uncharacterized conserved protein, contains ParB-like and HNH nuclease domains), COG3291 (PKD repeat), and COG3791 (uncharacterized conserved protein) have yet to be examined. According to InterPro analysis, COG3791 genes have a domain of glutathione-dependent formaldehyde-activating enzyme (IPR006913); therefore, these genes might be related to formaldehyde detoxification.[45] The KO group, K08998, corresponds to COG0759 (membrane-anchored protein YidD) of category M in the COG database that has previously been reported to be involved in the protein insertion process.[46] K07062 corresponds to COG1487 and is considered a toxic protein.[47] As a whole, it has to be said that the evolutionary significance of these six gene groups has not been fully realized. Conversely, these genes might be good targets for evolutionary studies in the context of HGT, providing an example of data-driven approaches from massive sequence data.[48] Of course, there may still be a sampling bias depending on the database status at the time; however, the content of the database will further increase and the annotation level will be further refined in the future. When ortholog grouping is performed using all available genomes, including those recently sequenced, novel ortholog groups common to many species may be found, whereby widely transferred novel HT genes might also be found. In this study, the proportion of genes attributed to at least an ortholog group per species was 81% for the COG database and 75% for the KEGG database. However, for HT genes, the proportions decreased (67% and 59%, respectively), indicating that the genes that were not attributed to any ortholog group were often predicted as HT genes. In fact, 23% of the genes that were not attributed to any ortholog group were predicted as HT genes, while only 10% of the genes attributed to at least an ortholog group were predicted as HT genes (13% for 1348 species in average). Unless there is some reason why unknown genes are prone to be horizontally transferred, this observed difference in results may be caused by artificial processes such as in gene annotation. A clue is that the genes which were not attributed to any ortholog group were shorter than the genes attributed to at least an ortholog group (Supplementary Figure 4). Since a BLAST-based method was used to assign the ortholog groups, it is likely that shorter gene sequences have fewer significant matches by chance due to a small alignment score. However, this reasoning seems insufficient to explain the relationship between two observations—(1) the shortness of the genes attributed to no ortholog group and (2) the richness of HT genes predicted in the genes attributed to no ortholog group—because the HT gene prediction method developed in this study is not affected by sequence length. One possibility is that some of the predicted genes in the original genome project may be pseudogenes or falsely detected genes. In particular, frameshifts caused by nucleotide insertion/deletion can shorten the gene by the emergence of stop codons in the open reading frame and disturb the codon frequency, resulting in a small I value. This can explain both the above-mentioned observations; however, it is not currently easy to evaluate the frequency of frameshifts, because those depend on the status of genome sequencing. For example, the observed insertion/deletion might be an artifact caused by DNA sequencing errors, or the insertion/deletion may have actually occurred during cultivation of the strain. For a more thorough understanding of HGT, the next challenge will be to clarify the nature of genes with no attributed function.

Conclusions

In the present study, a novel method was developed for measuring the frequency bias of the adjoining codons that allows for the prediction of HT genes. Since this method is statistically robust against variations in gene length, it is applicable to all protein-coding genes, including fairly short ones. In this study, at the maximum scale possible for HT gene prediction, using more than 3000 prokaryote genomes, an average of 13% of the genes per genome were predicted to be of horizontal origin. The result revealed that the proportion of HT genes correlated with the species’ habitat, although the influence of genomic GC content was not negligible. Moreover, the functional categorization using the COG and KEGG databases showed that mobilome-related genes, particularly those in COG2801 and COG0582, were the most widely distributed HT genes among prokaryotic taxa. Except for the mobilome-related genes, the genes involved in cell defense (restriction-modification and CRISPR), the secretion system, or the cell surface (pilus, lipopolysaccharide) were predicted to be widely distributed HT genes. In addition, the genes attributed to COG3209, COG1479, COG3291, COG3791, COG0759 (K08998), and COG1487 (K07062) were widely transferred yet functionally uncharacterized; therefore, these genes may be interesting targets for future evolutionary studies. Although the functional classification of HT genes predicted in the present study depends on the status of the gene databases used, the future accumulation of genome sequence data and improvement of annotation may lead to the discovery of evolutionarily important HT genes by data-driven approaches.

45 in total

Review 1. Lateral gene transfer and the nature of bacterial innovation.

Authors: H Ochman; J G Lawrence; E A Groisman
Journal: Nature Date: 2000-05-18 Impact factor: 49.962

2. Recombination and horizontal transfer of nodulation and ACC deaminase (acdS) genes within Alpha- and Betaproteobacteria nodulating legumes of the Cape Fynbos biome.

Authors: Benny Lemaire; Jannick Van Cauwenberghe; Samson Chimphango; Charles Stirton; Olivier Honnay; Erik Smets; A Muthama Muasya
Journal: FEMS Microbiol Ecol Date: 2015-10-02 Impact factor: 4.194

3. Isolation of Neisseria gonorrhoeae mutants that show enhanced trafficking across polarized T84 epithelial monolayers.

Authors: S Hopper; J S Wilbur; B L Vasquez; J Larson; S Clary; I J Mehr; H S Seifert; M So
Journal: Infect Immun Date: 2000-02 Impact factor: 3.441

4. Data-driven hypotheses.

Authors: Paul van Helden
Journal: EMBO Rep Date: 2012-12-21 Impact factor: 8.807

5. The neighbor-joining method: a new method for reconstructing phylogenetic trees.

Authors: N Saitou; M Nei
Journal: Mol Biol Evol Date: 1987-07 Impact factor: 16.240

6. The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications.

Authors: P M Sharp; W H Li
Journal: Nucleic Acids Res Date: 1987-02-11 Impact factor: 16.971

7. Role for Escherichia coli YidD in membrane protein insertion.

Authors: Zhong Yu; Mariëlle Lavèn; Mirjam Klepsch; Jan-Willem de Gier; Wilbert Bitter; Peter van Ulsen; Joen Luirink
Journal: J Bacteriol Date: 2011-07-29 Impact factor: 3.490

8. Detection of short protein coding regions within the cyanobacterium genome: application of the hidden Markov model.

Authors: T Yada; M Hirosawa
Journal: DNA Res Date: 1996-12-31 Impact factor: 4.458

9. Plasmids derived from Gifsy-1/Gifsy-2, lambdoid prophages contributing to the virulence of Salmonella enterica serovar Typhimurium: implications for the evolution of replication initiation proteins of lambdoid phages and enterobacteria.

Authors: Bartosz Słomiński; Joanna Całkiewicz; Piotr Golec; Grzegorz Węgrzyn; Borys Wróbel
Journal: Microbiology (Reading) Date: 2007-06 Impact factor: 2.777

10. The emergence and fate of horizontally acquired genes in Escherichia coli.

Authors: Mark W J van Passel; Pradeep Reddy Marri; Howard Ochman
Journal: PLoS Comput Biol Date: 2008-04-11 Impact factor: 4.475

1 in total

1. ARTS 2.0: feature updates and expansion of the Antibiotic Resistant Target Seeker for comparative genome mining.

Authors: Mehmet Direnç Mungan; Mohammad Alanjary; Kai Blin; Tilmann Weber; Marnix H Medema; Nadine Ziemert
Journal: Nucleic Acids Res Date: 2020-07-02 Impact factor: 16.971

1 in total