| Literature DB >> 33074084 |
Zhencheng Fang1,2, Hongwei Zhou3,2.
Abstract
Plasmids are the key element in horizontal gene transfer in the microbial community. Recently, a large number of experimental and computational methods have been developed to obtain the plasmidomes of microbial communities. Distinguishing transmissible plasmid sequences, which are derived from conjugative or at least mobilizable plasmids, from non-transmissible plasmid sequences in the plasmidome is essential for understanding the diversity of plasmids and how they regulate the microbial community. Unfortunately, due to the highly fragmented characteristics of DNA sequences in the plasmidome, effective identification methods are lacking. In this work, we used information entropy from information theory to assess the randomness of synonymous codon usage over 4424 plasmid genomes. The results showed that for all amino acids, the choice of a synonymous codon in conjugative and mobilizable plasmids is more random than that in non-transmissible plasmids, indicating that transmissible plasmids have different sequence signatures from non-transmissible plasmids. Inspired by this phenomenon, we further developed a novel algorithm named PlasTrans. PlasTrans takes the triplet code sequences and base sequences of plasmid DNA fragments as input and uses the convolutional neural network of the deep learning technique to further extract the more complex signatures of the plasmid sequences and identify the conjugative and mobilizable DNA fragments. Tests showed that PlasTrans could achieve an AUC of as high as 84-91%, even though the fragments only contained hundreds of base pairs. To the best of our knowledge, this is the first quantitative analysis of the difference in sequence signatures between transmissible and non-transmissible plasmids, and we developed the first tool to perform transferability annotation for DNA fragments in the plasmidome. We expect that PlasTrans will be a useful tool for researchers who analyse the properties of novel plasmids in the microbial community and horizontal gene transfer, especially the spread of resistance genes and virulence factors associated with plasmids. PlasTrans is freely available via https://github.com/zhenchengfang/PlasTrans.Entities:
Keywords: deep learning; information theory; plasmid transmissibility; plasmidome; sequence signatures
Year: 2020 PMID: 33074084 PMCID: PMC7725325 DOI: 10.1099/mgen.0.000459
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Fig. 1.Structure of PlasTrans. PlasTrans takes the triplet code sequences and base sequences of a plasmid DNA fragment as input and provides a likelihood score that reflects whether the fragment is derived from a transmissible plasmid.
The average IEA of 19 amino acids on transmissible and non-transmissible plasmid genomes and the AUC for the use of a certain IEA to classify the transmissible and non-transmissible plasmid genomes. The P values of the two-sided Wilcoxon rank sum test showed that the IEA differences were significant
|
Amino acid |
Average IEA (%) |
AUC (%) |
| |
|---|---|---|---|---|
|
Transmissible plasmid |
Non-transmissible plasmid | |||
|
K |
88.58 |
84.56 |
57.05 |
1.24e−13 |
|
N |
94.57 |
85.55 |
64.45 |
3.85e−52 |
|
T |
93.73 |
88.16 |
62.21 |
9.35e−38 |
|
R |
90.33 |
84.40 |
64.93 |
1.42e−55 |
|
S |
94.17 |
90.07 |
64.56 |
6.70e−53 |
|
I |
87.92 |
86.39 |
53.40 |
3.52e−4 |
|
Q |
82.81 |
81.08 |
55.65 |
2.93e−9 |
|
H |
96.07 |
88.13 |
64.72 |
4.67e−54 |
|
P |
91.99 |
87.79 |
62.32 |
2.08e−38 |
|
L |
84.10 |
83.15 |
53.15 |
9.18e−4 |
|
E |
95.10 |
85.92 |
63.23 |
5.03e−44 |
|
D |
94.03 |
84.27 |
65.49 |
1.19e−59 |
|
A |
95.41 |
89.27 |
66.09 |
3.48e−64 |
|
G |
94.41 |
92.48 |
58.24 |
4.40e−18 |
|
V |
93.18 |
88.52 |
63.28 |
2.48e−44 |
|
Y |
95.18 |
85.01 |
64.05 |
2.15e−49 |
|
C |
91.83 |
88.30 |
55.94 |
4.19e−10 |
|
F |
88.00 |
78.89 |
62.38 |
9.87e−39 |
|
Stop amino acid |
85.35 |
81.95 |
55.82 |
9.30e−10 |
Fig. 2.PlasTrans performance. PlasTrans was evaluated based on the benchmark test set of artificial plasmid contigs in different groups separately using the criteria of recall, precision, F1 score and AUC.
Fig. 3.PlasTrans performance with different thresholds. Under a given threshold, t, a sequence with a score falling into the interval of |score−0.5|