| Literature DB >> 33826492 |
Ryota Gomi1,2, Kelly L Wyres1, Kathryn E Holt3,1.
Abstract
Plasmids play an important role in bacterial evolution and mediate horizontal transfer of genes including virulence and antimicrobial resistance genes. Although short-read sequencing technologies have enabled large-scale bacterial genomics, the resulting draft genome assemblies are often fragmented into hundreds of discrete contigs. Several tools and approaches have been developed to identify plasmid sequences in such assemblies, but require trade-off between sensitivity and specificity. Here we propose using the Kraken classifier, together with a custom Kraken database comprising known chromosomal and plasmid sequences of Klebsiella pneumoniae species complex (KpSC), to identify plasmid-derived contigs in draft assemblies. We assessed performance using Illumina-based draft genome assemblies for 82 KpSC isolates, for which complete genomes were available to supply ground truth. When benchmarked against five other classifiers (Centrifuge, RFPlasmid, mlplasmids, PlaScope and Platon), Kraken showed balanced performance in terms of overall sensitivity and specificity (90.8 and 99.4 %, respectively, for contig count; 96.5 and >99.9 %, respectively, for cumulative contig length), and the highest accuracy (96.8% vs 91.8-96.6% for contig count; 99.8% vs 99.0-99.7 % for cumulative contig length), and F1-score (94.5 % vs 84.5-94.1 %, for contig count; 98.0 % vs 88.9-96.7 % for cumulative contig length). Kraken also achieved consistent performance across our genome collection. Furthermore, we demonstrate that expanding the Kraken database with additional known chromosomal and plasmid sequences can further improve classification performance. Although we have focused here on the KpSC, this methodology could easily be applied to other species with a sufficient number of completed genomes.Entities:
Keywords: Klebsiella pneumoniae; antimicrobial resistance gene; plasmid detection; whole-genome sequencing
Year: 2021 PMID: 33826492 PMCID: PMC8208688 DOI: 10.1099/mgen.0.000550
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Characteristics of plasmid prediction tools
|
Tool |
Classification features |
Classification approach/method |
Models/databases |
Reference |
|---|---|---|---|---|
|
cBar |
Pentamer frequencies |
Sequential minimal optimization |
Taxon-independent model |
Zhou |
|
PlasFlow |
|
Neural network |
Taxon-independent model |
Krawczyk |
|
Platon |
Chromosomal and plasmid marker protein sequences and other features (e.g. contig length) |
Calculation of a mean replicon distribution score and higher-level contig characterizations |
Taxon-independent database |
Schwengers |
|
plasmidSPAdes |
Read coverage of contigs and structural features of the assembly graph |
Assemble plasmids from whole genome sequencing data |
– |
Antipov |
|
RFPlasmid |
Pentamer frequencies and chromosomal and plasmid marker genes |
Random forest |
Taxon-specific models for 17 taxa and a taxon-independent model available |
van der Graaf-van Bloois |
|
mlplasmids |
Pentamer frequencies |
Support-vector machine |
Three taxon-specific models available |
Arredondo-Alonso |
|
PlaScope |
Exact matches to a database of known sequences |
Centrifuge |
Two taxon-specific databases available (novel databases can be created by the user) |
Royer |
Performance metrics for Kraken, Centrifuge, RFPlasmid, mlplasmids, PlaScope and Platon on contigs from 82 KpSC genomes
|
Kraken |
Centrifuge |
RFPlasmid |
mlplasmids |
PlaScope |
Platon | |
|---|---|---|---|---|---|---|
|
Sensitivity (true positive rate, %) | ||||||
|
Contig length |
|
94.565 |
94.198 |
95.987 |
87.199 |
80.215 |
|
Contig count |
90.796 |
90.485 |
82.898 |
|
81.157 |
74.192 |
|
Specificity (true negative rate, %) | ||||||
|
Contig length |
99.973 |
99.941 |
99.859 |
99.727 |
99.958 |
|
|
Contig count |
99.378 |
99.189 |
98.458 |
91.804 |
99.243 |
|
|
Precision (positive predictive value, %) | ||||||
|
Contig length |
99.472 |
98.851 |
97.289 |
94.965 |
99.120 |
|
|
Contig count |
|
97.980 |
95.899 |
83.568 |
97.899 |
98.189 |
|
Negative predictive value (%) | ||||||
|
Contig length |
|
99.709 |
99.689 |
99.784 |
99.317 |
98.948 |
|
Contig count |
96.128 |
95.995 |
92.976 |
|
92.372 |
89.853 |
|
Accuracy (%) | ||||||
|
Contig length |
|
99.667 |
99.570 |
99.536 |
99.308 |
98.977 |
|
Contig count |
|
96.550 |
93.742 |
93.025 |
93.761 |
91.762 |
|
F1-score (%) | ||||||
|
Contig length |
|
96.660 |
95.719 |
95.474 |
92.778 |
88.878 |
|
Contig count |
|
94.083 |
88.926 |
89.282 |
88.745 |
84.520 |
Total contig sizes and contig counts assigned to each category (true positive, true negative, false positive and false negative) are provided in Table S3. Bold font indicates the highest value in each category.
Fig. 1.Performance metrics for Kraken, Centrifuge, RFPlasmid, mlplasmids, PlaScope and Platon for each genome in terms of contig length (a) and contig count (b). NPV, negative predictive value.
Fig. 2.Classification results for 87 contigs that were classified into different categories after expanding the Kraken database. Contig number for each category is shown next to the diagram. Note that contigs classified into the same category are not shown in this figure (n=5218 contigs). The Sankey diagram was created using the riverplot R package (v0.6).