| Literature DB >> 29713314 |
Kujin Tang1, Yang Young Lu1, Fengzhu Sun1,2.
Abstract
Horizontal gene transfer (HGT) plays an important role in the evolution of microbial organisms including bacteria. Alignment-free methods based on single genome compositional information have been used to detect HGT. Currently, Manhattan and Euclidean distances based on tetranucleotide frequencies are the most commonly used alignment-free dissimilarity measures to detect HGT. By testing on simulated bacterial sequences and real data sets with known horizontal transferred genomic regions, we found that more advanced alignment-free dissimilarity measures such as CVTree and [Formula: see text] that take into account the background Markov sequences can solve HGT detection problems with significantly improved performance. We also studied the influence of different factors such as evolutionary distance between host and donor sequences, size of sliding window, and host genome composition on the performances of alignment-free methods to detect HGT. Our study showed that alignment-free methods can predict HGT accurately when host and donor genomes are in different order levels. Among all methods, CVTree with word length of 3, [Formula: see text] with word length 3, Markov order 1 and [Formula: see text] with word length 4, Markov order 1 outperform others in terms of their highest F1-score and their robustness under the influence of different factors.Entities:
Keywords: CVTree; alignment-free; d*2; genomic island; horizontal gene transfer; kmer
Year: 2018 PMID: 29713314 PMCID: PMC5911508 DOI: 10.3389/fmicb.2018.00711
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
Complete evaluation results for different dissimilarity measures with different word lengths k and Markov orders when needed.
| 0.77 ± 0.01 | 0.99 ± 0.01 | 4.50 | ||
| 0.81 ± 0.01 | 0.95 ± 0.02 | 2.75 | ||
| 0.70 ± 0.02 | 0.71 ± 0.05 | 0.71 ± 0.03 | 1.25 | |
| 0.70 ± 0.04 | 0.65 ± 0.11 | 0.67 ± 0.08 | 4.75 | |
| 0.77 ± 0.01 | 0.99 ± 0.00 | 4.25 | ||
| 0.56 ± 0.01 | 0.96 ± 0.02 | 0.71 ± 0.01 | 2.00 | |
| 0.77 ± 0.01 | 0.99 ± 0.01 | 3.75 | ||
| 0.77 ± 0.01 | 0.96 ± 0.02 | 0.86 ± 0.01 | 2.25 | |
| 0.58 ± 0.01 | 0.93 ± 0.03 | 0.71 ± 0.01 | 2.00 | |
| 0.76 ± 0.01 | 0.98 ± 0.01 | 0.86 ± 0.01 | 3.00 | |
| 0.82 ± 0.01 | 0.90 ± 0.03 | 0.86 ± 0.02 | 2.25 | |
| 0.54 ± 0.03 | 0.78 ± 0.05 | 0.64 ± 0.03 | 1.00 | |
| 0.39 ± 0.12 | 0.82 ± 0.21 | 0.49 ± 0.03 | 0.50 | |
| 0.75 ± 0.01 | 0.99 ± 0.01 | 0.85 ± 0.01 | 2.50 | |
| 0.54 ± 0.10 | 0.83 ± 0.19 | 0.63 ± 0.04 | 0.75 | |
| 0.76 ± 0.06 | 0.79 ± 0.18 | 0.76 ± 0.09 | 1.00 | |
| 0.74 ± 0.02 | 0.79 ± 0.13 | 0.76 ± 0.06 | 1.00 | |
| 0.58 ± 0.02 | 0.80 ± 0.09 | 0.67 ± 0.03 | 1.00 | |
| 0.74 ± 0.03 | 0.89 ± 0.08 | 0.80 ± 0.03 | 1.50 | |
| 0.83 ± 0.02 | 0.87 ± 0.06 | 0.85 ± 0.03 | 1.50 | |
| 0.63 ± 0.02 | 0.67 ± 0.08 | 0.65 ± 0.04 | 1.00 | |
| Ma(3) | 0.75 ± 0.04 | 0.79 ± 0.12 | 0.76 ± 0.07 | 2.50 |
| Ma(4) | 0.80 ± 0.03 | 0.80 ± 0.12 | 0.80 ± 0.07 | 3.00 |
| Ma(5) | 0.79 ± 0.03 | 0.81 ± 0.12 | 0.80 ± 0.07 | 3.25 |
| Eu(3) | 0.76 ± 0.03 | 0.78 ± 0.13 | 0.76 ± 0.07 | 2.50 |
| Eu(4) | 0.80 ± 0.02 | 0.77 ± 0.12 | 0.79 ± 0.07 | 2.75 |
| Eu(5) | 0.79 ± 0.02 | 0.80 ± 0.12 | 0.79 ± 0.07 | 2.75 |
| 0.80 ± 0.04 | 0.76 ± 0.12 | 0.78 ± 0.07 | 5.00 | |
| 0.77 ± 0.04 | 0.82 ± 0.12 | 0.79 ± 0.06 | 4.50 | |
| 0.81 ± 0.03 | 0.81 ± 0.12 | 0.81 ± 0.07 | 4.50 |
Numbers in the brackets in the first column indicate the word length k and Markov order used by methods .
Figure 1Precision-Recall Curves (PRC) for all the methods. (A) Shows PRC for CVTree with different word lengths. (B) Shows PRC for all the methods. (C) Shows PRC for all methods. (D) Shows PRC for Manhattan, Euclidean and d2.
Performance of different alignment-free HGT detection methods over 20 artificial genomes with different donor genomes.
| 0.027 | 0.18 ± 0.03 | 0.18 ± 0.03 | 0.17 ± 0.03 | 0.17 ± 0.04 | 0.16 ± 0.02 | 0.16 ± 0.02 | 0.17 ± 0.03 | |
| 0.038 | 0.19 ± 0.02 | 0.15 ± 0.02 | 0.19 ± 0.02 | 0.18 ± 0.02 | 0.17 ± 0.02 | 0.16 ± 0.02 | 0.18 ± 0.02 | |
| 0.044 | 0.21 ± 0.02 | 0.17 ± 0.02 | 0.21 ± 0.01 | 0.21 ± 0.02 | 0.17 ± 0.02 | 0.17 ± 0.02 | 0.18 ± 0.02 | |
| 0.090 | 0.23 ± 0.02 | 0.19 ± 0.02 | 0.23 ± 0.02 | 0.22 ± 0.02 | 0.25 ± 0.01 | 0.27 ± 0.01 | 0.27 ± 0.01 | |
| 0.119 | 0.16 ± 0.01 | 0.27 ± 0.02 | 0.14 ± 0.02 | 0.15 ± 0.02 | 0.26 ± 0.02 | 0.26 ± 0.02 | 0.25 ± 0.02 | |
| 0.123 | 0.23 ± 0.02 | 0.19 ± 0.02 | 0.21 ± 0.03 | 0.26 ± 0.01 | 0.26 ± 0.01 | 0.25 ± 0.02 | ||
| 0.124 | 0.27 ± 0.02 | 0.27 ± 0.02 | 0.29 ± 0.02 | 0.30 ± 0.02 | 0.32 ± 0.03 | 0.32 ± 0.02 | ||
| 0.141 | 0.23 ± 0.02 | 0.28 ± 0.02 | 0.19 ± 0.03 | 0.21 ± 0.02 | 0.30 ± 0.02 | 0.30 ± 0.02 | ||
| 0.160 | 0.51 ± 0.02 | 0.50 ± 0.02 | 0.56 ± 0.02 | 0.33 ± 0.03 | 0.30 ± 0.03 | 0.37 ± 0.02 | ||
| 0.223 | 0.39 ± 0.02 | 0.27 ± 0.02 | 0.29 ± 0.01 | 0.33 ± 0.01 | 0.44 ± 0.02 | 0.43 ± 0.02 | ||
| 0.228 | 0.28 ± 0.03 | 0.26 ± 0.03 | 0.21 ± 0.02 | 0.23 ± 0.01 | 0.43 ± 0.01 | |||
| 0.261 | 0.87 ± 0.01 | 0.85 ± 0.01 | 0.55 ± 0.02 | 0.51 ± 0.04 | 0.58 ± 0.01 | |||
| 0.283 | 0.60 ± 0.02 | 0.59 ± 0.01 | 0.63 ± 0.02 | 0.56 ± 0.01 | 0.55 ± 0.01 | 0.57 ± 0.01 | ||
| 0.301 | 0.68 ± 0.02 | 0.62 ± 0.02 | 0.63 ± 0.02 | 0.54 ± 0.01 | 0.52 ± 0.03 | 0.52 ± 0.01 | ||
| 0.308 | 0.82 ± 0.01 | 0.85 ± 0.01 | 0.85 ± 0.01 | 0.63 ± 0.01 | 0.60 ± 0.01 | 0.55 ± 0.01 | ||
| 0.449 | 0.84 ± 0.01 | 0.78 ± 0.01 | 0.84 ± 0.01 | 0.85 ± 0.01 | 0.82 ± 0.02 | 0.84 ± 0.02 | ||
| 0.487 | 0.86 ± 0.01 | 0.83 ± 0.01 | 0.82 ± 0.01 | 0.85 ± 0.03 | 0.85 ± 0.03 | 0.76 ± 0.02 | ||
| 0.550 | 0.89 ± 0.00 | 0.79 ± 0.01 | 0.86 ± 0.01 | 0.81 ± 0.01 | 0.89 ± 0.01 | 0.81 ± 0.01 | ||
| 0.682 | 0.87 ± 0.01 | 0.95 ± 0.01 | 0.94 ± 0.02 | 0.90 ± 0.02 | 0.90 ± 0.03 | 0.88 ± 0.03 | ||
| 0.713 | 0.97 ± 0.00 | 0.94 ± 0.01 | 0.97 ± 0.00 | 0.97 ± 0.00 | 0.97 ± 0.00 |
The first column shows the donor genome of the artificial genome. The top 12 species have the same order level as E. coli and the bottom 8 species have different order level from E. coli. The second column is the Manhattan distance between donor genome and E. coli K12 based on tetranucletide frequency. The third to the ninth columns are the optimal F.
Figure 2The Precision-Recall Curves (PRC) of different HGT detection methods along artificial genomes using E. coli as host genome. (A) PRC when using S.sonnei as donor genome, no methods performs well. (B) PRC when using B. abortus as donor genome, CVT(3), CVT(4), and outperform other methods. (C) PRC when using C.coli as donor genome, all methods perform reasonably well.
Performance of different methods over artificial genomes by using different window sizes.
| 3 | 0.84 ± 0.01 | 0.82 ± 0.01 | 0.85 ± 0.01 | 0.47 ± 0.02 | 0.43 ± 0.02 | 0.53 ± 0.02 | ||
| 5 | 0.87 ± 0.01 | 0.85 ± 0.01 | 0.55 ± 0.02 | 0.51 ± 0.04 | 0.58 ± 0.01 | |||
| 8 | 0.91 ± 0.01 | 0.94 ± 0.01 | 0.62 ± 0.02 | 0.60 ± 0.02 | 0.64 ± 0.01 | |||
| 3 | 0.62 ± 0.01 | 0.58 ± 0.02 | 0.59 ± 0.02 | 0.51 ± 0.02 | 0.47 ± 0.02 | 0.49 ± 0.02 | ||
| 5 | 0.68 ± 0.02 | 0.62 ± 0.02 | 0.63 ± 0.02 | 0.54 ± 0.01 | 0.52 ± 0.03 | 0.52 ± 0.01 | ||
| 8 | 0.72 ± 0.02 | 0.68 ± 0.04 | 0.66 ± 0.02 | 0.61 ± 0.03 | 0.56 ± 0.03 | 0.56 ± 0.02 | ||
| 3 | 0.77 ± 0.02 | 0.78 ± 0.01 | 0.78 ± 0.01 | 0.55 ± 0.01 | 0.52 ± 0.01 | 0.50 ± 0.01 | ||
| 5 | 0.82 ± 0.01 | 0.85 ± 0.01 | 0.85 ± 0.01 | 0.63 ± 0.01 | 0.60 ± 0.01 | 0.55 ± 0.01 | ||
| 8 | 0.88 ± 0.02 | 0.93 ± 0.01 | 0.91 ± 0.01 | 0.68 ± 0.01 | 0.66 ± 0.02 | 0.61 ± 0.02 | ||
| 3 | 0.79 ± 0.02 | 0.73 ± 0.01 | 0.79 ± 0.01 | 0.79 ± 0.01 | 0.77 ± 0.01 | 0.79 ± 0.01 | ||
| 5 | 0.84 ± 0.01 | 0.78 ± 0.01 | 0.84 ± 0.01 | 0.85 ± 0.01 | 0.82 ± 0.02 | 0.84 ± 0.02 | ||
| 8 | 0.91 ± 0.02 | 0.82 ± 0.01 | 0.89 ± 0.01 | 0.88 ± 0.02 | 0.86 ± 0.01 | 0.87 ± 0.01 | ||
| 3 | 0.74 ± 0.01 | 0.74 ± 0.01 | 0.78 ± 0.02 | 0.78 ± 0.03 | 0.69 ± 0.02 | |||
| 5 | 0.86 ± 0.01 | 0.83 ± 0.01 | 0.82 ± 0.01 | 0.85 ± 0.03 | 0.85 ± 0.03 | 0.76 ± 0.02 | ||
| 8 | 0.92 ± 0.01 | 0.92 ± 0.01 | 0.91 ± 0.02 | 0.86 ± 0.03 | 0.82 ± 0.02 | 0.81 ± 0.02 | ||
| 3 | 0.73 ± 0.02 | 0.79 ± 0.01 | 0.72 ± 0.01 | 0.84 ± 0.01 | 0.83 ± 0.02 | 0.76 ± 0.01 | ||
| 5 | 0.89 ± 0.00 | 0.79 ± 0.01 | 0.86 ± 0.01 | 0.81 ± 0.01 | 0.89 ± 0.01 | 0.81 ± 0.01 | ||
| 8 | 0.84 ± 0.01 | 0.95 ± 0.01 | 0.90 ± 0.01 | 0.90 ± 0.01 | 0.88 ± 0.03 | 0.86 ± 0.01 | ||
| 3 | 0.83 ± 0.01 | 0.91 ± 0.01 | 0.90 ± 0.01 | 0.90 ± 0.03 | 0.91 ± 0.02 | 0.84 ± 0.03 | ||
| 5 | 0.87 ± 0.01 | 0.95 ± 0.01 | 0.94 ± 0.02 | 0.90 ± 0.02 | 0.90 ± 0.03 | 0.88 ± 0.03 | ||
| 8 | 0.93 ± 0.01 | 0.96 ± 0.01 | 0.96 ± 0.01 | 0.89 ± 0.02 | 0.89 ± 0.03 | 0.86 ± 0.02 | ||
| 3 | 0.96 ± 0.01 | 0.90 ± 0.00 | 0.95 ± 0.01 | 0.96 ± 0.00 | ||||
| 5 | 0.97 ± 0.00 | 0.94 ± 0.01 | 0.97 ± 0.00 | 0.97 ± 0.00 | 0.97 ± 0.00 | |||
| 8 | 0.93 ± 0.01 | 0.96 ± 0.01 | 0.94 ± 0.00 | 0.96 ± 0.01 | 0.96 ± 0.01 | 0.96 ± 0.00 |
Values in the second column are the window sizes. All the other columns are the same as in Table .
WS, window size.
Performance of different methods over 118 genomes with known HGT genomic islands in Langille et al. (2008) based on (a) optimal accuracy and (b) optimal F1-score.
| 0.68 | 0.41 | 0.84 | 0.54 | 0.60 | 0.57 | |
| 0.62 | 0.31 | 0.83 | 0.50 | 0.56 | 0.53 | |
| 0.72 | 0.38 | 0.85 | 0.57 | 0.58 | 0.58 | |
| 0.72 | 0.45 | 0.86 | 0.58 | 0.63 | 0.61 | |
| Ma(5) | 0.67 | 0.26 | 0.83 | 0.48 | 0.68 | 0.56 |
| Eu(5) | 0.58 | 0.46 | 0.83 | 0.50 | 0.63 | 0.55 |
| 0.60 | 0.30 | 0.82 | 0.45 | 0.67 | 0.53 | |
The second and third columns show the precision and recall to achieve the optimal accuracy given in the fourth column. The fifth and sixth columns show the precision and recall corresponding to the optimal F.
Figure 3The Precision-Recall Curves (PRC) of the different methods based on 118 genomes with known HGT genomic islands.
The distances between each gene and E. faecalis V583 genome were calculated and genes were ranked by their distances.
| EF2293 | 607 | 815 | 688 | 605 | 854 | 1,001 | 511 |
| EF2294 | 325 | 1,874 | 222 | 447 | 1,302 | 1,373 | 719 |
| EF2295 | 138 | 855 | 109 | 219 | 1,169 | 1,273 | 520 |
| EF2296 | 379 | 1,613 | 313 | 385 | 1,392 | 1,491 | 850 |
| EF2297 | 618 | 2,638 | 665 | 1,245 | 1,117 | 1,165 | 551 |
| EF2298 | 660 | 1,355 | 702 | 772 | 1,978 | 1,924 | 1,025 |
| EF2299 | 687 | 1,084 | 477 | 607 | 814 | 820 | 384 |
| Median | 607 | 1,355 | 477 | 605 | 1,169 | 1,273 | 551 |
| Mean | 487.7 | 1,462.0 | 453.7 | 611.4 | 1,232.3 | 1,292.4 | 651.4 |
The first to seventh rows show the ranks of EF2293-EF2299 among all E.faecalis V583 genes calculated by different methods. The eighth and ninth rows show the median and mean of the ranks of the seven genes.