| Literature DB >> 25159222 |
Qiyun Zhu1, Michael Kosoy, Katharina Dittmar.
Abstract
BACKGROUND: First pass methods based on BLAST match are commonly used as an initial step to separate the different phylogenetic histories of genes in microbial genomes, and target putative horizontal gene transfer (HGT) events. This will continue to be necessary given the rapid growth of genomic data and the technical difficulties in conducting large-scale explicit phylogenetic analyses. However, these methods often produce misleading results due to their inability to resolve indirect phylogenetic links and their vulnerability to stochastic events.Entities:
Mesh:
Year: 2014 PMID: 25159222 PMCID: PMC4155097 DOI: 10.1186/1471-2164-15-717
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Flowchart illustrating the procedures of the HGTector method. Parallelograms indicate input data or information, rectangles indicate processing steps, diamonds indicate decisions, and rounded rectangles indicate start and end of the work flow. Graphic illustrations of hypothetical phylogenetic tree, BLAST hit table and fingerprint are drawn on the right side of their corresponding steps.
Figure 2Tree topology and fingerprint (distributions of BLAST hit weights) of tests on simulated genomes. One representative test using either the idealized tree topology (A) or the randomized tree topology (B) is depicted (see text). The kernel density function of close weight distribution for both topologies (C, D) shows the distribution of all genes in the input genomes in black, and that of actual positive genes (derived from HGT events from distal group to self group) in red. Genes involved in other simulated evolutionary events are shown in different colors in the lower panels. Locations of these genes in the general distribution are indicated as a rug below each plot. The scales of x-axes between the upper and the lower panels are identical. Cutoffs computed by the program distinguishing the atypical region from the typical region are represented in dashed (for relaxed criterion) and dotted (for conservative criterion) lines.
Figure 3Fingerprint of seven genomes. BLAST hit weights of all protein-coding genes in the seven Rickettsia genomes are plotted. (A-C) Kernel density functions of the self, close and distal weights. The x-axis represents the weight of each gene. The y-axis represents the probability density of genes with the corresponding weight in the genomes. In this example, rule 1 (close weight < cutoff) and rule 2 (distal weight > =cutoff) were applied. The close and distal cutoffs computed under the conservative criterion are indicated by dashed lines. The values of the cutoffs are denoted in each panel. (D) A scatter plot of the distal weight against the close weight, showing the clustering pattern of the genes. Each dot represents one gene. Genes predicted to be HGT-derived are framed by a red rectangle. (E) A zoom-in view of the left part of the previous plot. Genes that fall within the atypical region in the close weight distribution are colored by a blue-red color scheme based on the density-based silhouette (dbs), a measure of confidence that this gene belongs to the atypical cluster of genes (red = high confidence). The close cutoff used in the subsequent analyses is indicated by a dashed line.
Real genomic datasets tested in this study
| Category |
|
| No. of genes | Date of BLAST | Max. no. of hits | List of input genomes (organism name and NCBI accn. no.) |
|---|---|---|---|---|---|---|
| Alphaproteobacteria |
| Rickettsiales | 8484 | Jan. 2013 | 200 |
|
|
| ||||||
|
| ||||||
|
| ||||||
|
| ||||||
|
| ||||||
|
| ||||||
| Firmicutes |
| Bacilli | 11906 | Nov. 2013 | 100 |
|
|
| ||||||
|
| ||||||
|
| ||||||
|
| ||||||
|
| ||||||
| Epsilonproteobacteria |
| Campylobacterales | 10531 | Mar. 2013, Nov. 20132 | 200 |
|
|
| ||||||
|
| ||||||
|
| ||||||
|
| ||||||
|
| ||||||
| Gammaproteobacteria |
| Enterobacteriales | 19013 | Mar. 2013 | 200 |
|
|
| ||||||
|
| ||||||
|
| ||||||
|
| ||||||
| Actinobacteria |
| Mycobacterium | 3830 | Oct. 2013 | 100 |
|
| Unicellular red algae |
| Eukaryota | 7174 | Dec. 2013 | 50 |
|
| Higher animal |
| Animalia | 225164 | Nov. 2013 | 1000 |
|
1The genomes used in this study are identical to those used in [37].
2Two independent analyses were conducted on different dates, and similar outcomes were obtained. The more recent result was reported.
3The genome used in this study is identical to that used in [79].
4For genes with multiple isoforms, the longest CDS was extracted using an in-house Perl script and used for the analysis.
Figure 4Comparison of performance of HGTector and conventional BLAST-based method on simulated genomes. The methods were tested on the simulated genomic data under an idealized (A) or randomized (B) tree topology. “Con” and “Rel” represent conservative and relaxed cutoffs in the HGTector analysis. “C = 0” and “D > C” are two criteria in the conventional BLAST-based method. Each experimental group is composed of 100 tests. The distribution of results in terms of precision (red) and recall (blue) is depicted by box plots. The mean value of each group is label above the corresponding column.
Summary of genes predicted to be horizontally acquired in seven genomes
| Abbreviation | Size of chromosome (Mb) | Number of chromosomal protein-coding genes | Number of predicted HGT-derived genes | Percentage of predicted HGT-derived genes |
|---|---|---|---|---|
|
| 1.23 | 1256 | 76 | 6.05% |
|
| 1.49 | 1400 | 256 | 18.29% |
|
| 1.36 | 968 | 93 | 9.61% |
|
| 1.28 | 1114 | 72 | 6.46% |
|
| 1.26 | 1342 | 98 | 7.30% |
|
| 1.28 | 1030 | 78 | 7.57% |
|
| 1.27 | 1374 | 127 | 9.24% |
|
| 1.31 | 1212 | 114.3 | 9.43% |
Figure 5Predicted HGTs by multiple methods mapped onto the genome. A circular view of the whole chromosome of R. felis, with genomic islands (GIs) predicted by IslandViewer and GIST indicated by boxes, and putative HGT-derived genes predicted by other methods indicated by arrowheads.