| Literature DB >> 32243433 |
David Pellow1, Itzik Mizrahi2, Ron Shamir1.
Abstract
Many bacteria contain plasmids, but separating between contigs that originate on the plasmid and those that are part of the bacterial genome can be difficult. This is especially true in metagenomic assembly, which yields many contigs of unknown origin. Existing tools for classifying sequences of plasmid origin give less reliable results for shorter sequences, are trained using a fraction of the known plasmids, and can be difficult to use in practice. We present PlasClass, a new plasmid classifier. It uses a set of standard classifiers trained on the most current set of known plasmid sequences for different sequence lengths. We tested PlasClass sequence classification on held-out data and simulations, as well as publicly available bacterial isolates and plasmidome samples and plasmids assembled from metagenomic samples. PlasClass outperforms the state-of-the-art plasmid classification tool on shorter sequences, which constitute the majority of assembly contigs, allowing it to achieve higher F1 scores in classifying sequences from a wide range of datasets. PlasClass also uses significantly less time and memory. PlasClass can be used to easily classify plasmid and bacterial genome sequences in metagenomic or isolate assemblies. It is available under the MIT license from: https://github.com/Shamir-Lab/PlasClass.Entities:
Year: 2020 PMID: 32243433 PMCID: PMC7159247 DOI: 10.1371/journal.pcbi.1007781
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Performance on held out data.
| Length (bp) | # fragments per class | PlasClass | PlasFlow | ||||
|---|---|---|---|---|---|---|---|
| Precision | Recall | F1 | Precision | Recall | F1 | ||
| 100k | 2979 | 96.9 | 85.4 | 90.8 | 95.6 | 88.4 | 91.9 |
| 10k | 56583 | 88.7 | 86.4 | 87.6 | 83.1 | 87.7 | 85.3 |
| 1k | 607656 | 75.1 | 74.6 | 74.8 | 59.7 | 79.1 | 68.1 |
Performance of PlasClass and PlasFlow on fixed length sequence fragments sampled from the held out references.
Performance on bacterial isolates.
| Contig length (bp) | # of contigs | PlasClass | PlasFlow | ||||
|---|---|---|---|---|---|---|---|
| Precision | Recall | F1 | Precision | Recall | F1 | ||
| All | 36172 | 43.65 | 77.58 | 55.87 | 31.16 | 87.77 | 46.00 |
| >500 | 11659 | 53.15 | 91.30 | 67.18 | 37.68 | 89.23 | 52.99 |
| >1000 | 7414 | 59.95 | 91.82 | 72.54 | 47.54 | 90.04 | 62.23 |
| >5000 | 3999 | 61.84 | 92.12 | 74.00 | 50.05 | 92.31 | 64.91 |
Performance on bacterial isolates from [6], as a function of the minimum contig length.
Performance on simulated metagenomes.
| # chromosomes | # plasmids | # unique | # contigs | PlasClass F1 | PlasFlow F1 | |
|---|---|---|---|---|---|---|
| Sim1 | 34 | 82 | 56 | 34092 | 15.79 | 13.49 |
| Sim2 | 198 | 333 | 219 | 388669 | 12.08 | 8.79 |
Summary of the simulated metagenome datasets and comparison of F1 scores. # unique is the number of distinct plasmids, ignoring multiple copies.
Simulated metagenome performance by length.
| Contig length (bp) | # of contigs | PlasClass | PlasFlow | |||||
|---|---|---|---|---|---|---|---|---|
| Precision | Recall | F1 | Precision | Recall | F1 | |||
| Sim1 | All | 34092 | 8.94 | 67.40 | 15.79 | 7.30 | 87.75 | 13.49 |
| >500 | 17023 | 11.22 | 78.55 | 19.64 | 8.20 | 85.05 | 14.95 | |
| >1000 | 11696 | 15.67 | 80.96 | 26.26 | 10.92 | 85.00 | 19.36 | |
| >5000 | 4032 | 36.11 | 86.80 | 51.00 | 28.09 | 90.80 | 42.91 | |
| Sim2 | All | 388669 | 6.64 | 66.98 | 12.08 | 4.64 | 84.31 | 8.79 |
| >500 | 106814 | 13.76 | 76.00 | 23.29 | 8.42 | 84.23 | 15.32 | |
| >1000 | 45597 | 22.42 | 79.20 | 34.95 | 14.01 | 86.52 | 24.11 | |
| >5000 | 5642 | 46.50 | 81.18 | 59.13 | 38.48 | 88.49 | 53.63 | |
Performance on simulated metagenomes as a function of the minimum contig length.
Performance on a plasmidome sample.
| Precision | Recall | F1 score | |
|---|---|---|---|
| PlasClass | 64.25 | ||
| PlasFlow | 23.72 | 37.23 |
Performance of PlasClass and PlasFlow on the plasmidome sample from [8].
Fig 1Classifying plasmids assembled from metagenomic samples.
Agreement of PlasClass and PlasFlow classifications with the plasmids generated by Recycler.
Resource usage.
| Dataset | PlasFlow | PlasClass | PlasClass—8 processes | ||||
|---|---|---|---|---|---|---|---|
| Runtime | RAM | Disk | Runtime | RAM | Runtime | RAM | |
| Isolates | 12.8 | 47.8 | 21.4 | 36.3 | 17.2 | 6.8 | 17.2 |
| Sim1 | 7.1 | 28.3 | 12.1 | 16.2 | 12.0 | 3.0 | 12.0 |
| Sim2 | 89.3 | 291.3 | 137.5 | 54.8 | 17.3 | 17.1 | 17.3 |
| Plasmidome | 7.9 | 28.8 | 12.2 | 4.2 | 12.2 | 5.2 | 17.3 |
Runtime (wall clock time, in minutes) and memory usage (in GB) of PlasClass and PlasFlow.