| Literature DB >> 29444186 |
Paul D Donovan1, Gabriel Gonzalez2, Desmond G Higgins3, Geraldine Butler1, Kimihito Ito2,4.
Abstract
Metagenomics uses nucleic acid sequencing to characterize species diversity in different niches such as environmental biomes or the human microbiome. Most studies have used 16S rRNA amplicon sequencing to identify bacteria. However, the decreasing cost of sequencing has resulted in a gradual shift away from amplicon analyses and towards shotgun metagenomic sequencing. Shotgun metagenomic data can be used to identify a wide range of species, but have rarely been applied to fungal identification. Here, we develop a sequence classification pipeline, FindFungi, and use it to identify fungal sequences in public metagenome datasets. We focus primarily on animal metagenomes, especially those from pig and mouse microbiomes. We identified fungi in 39 of 70 datasets comprising 71 fungal species. At least 11 pathogenic species with zoonotic potential were identified, including Candida tropicalis. We identified Pseudogymnoascus species from 13 Antarctic soil samples initially analyzed for the presence of bacteria capable of degrading diesel oil. We also show that Candida tropicalis and Candida loboi are likely the same species. In addition, we identify several examples where contaminating DNA was erroneously included in fungal genome assemblies.Entities:
Mesh:
Year: 2018 PMID: 29444186 PMCID: PMC5812651 DOI: 10.1371/journal.pone.0192898
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Species used to generate three simulated read datasets.
| Species | Accession Numbers | Number of bp | Simulated dataset (reads) | |||
|---|---|---|---|---|---|---|
| Genome | Exome | Standard | Spiked | RNA-seq | ||
| NC_000964.3 | 4215606 | 3697728 | 421560 | 421560 | 348870 | |
| NC_006347.1/NC_006297.1 | 5310990 | 4787184 | 531090 | 531090 | 455540 | |
| NC_014638.1 | 2214656 | 1853190 | 221460 | 221460 | 176810 | |
| NC_006814.3 | 1993560 | 1741788 | 199350 | 199350 | 165210 | |
| NC_003997.3 | 5227293 | 4234317 | 522720 | 522720 | 397230 | |
| NC_005956.1 | 1931047 | 1386678 | 193100 | 193100 | 131170 | |
| NC_008508.1/NC_008509.1 | 3931782 | 3023346 | 393170 | 393170 | 285500 | |
| NC_007795.1 | 2821361 | 2352093 | 282104 | 282110 | 221610 | |
| NC_003131.1/NC_003132.1/NC_003134.1/NC_003143.1 | 4829855 | 3852405 | 482980 | 482980 | 365300 | |
| calb_Chr_1 (assembly 19) | 3188548 | 2014897 | 317216 | 317172 | 194026 | |
| NC_002516.2 | 6264404 | - | - | 626440 | - | |
| NC_012560.1 | 5365318 | - | - | 536530 | - | |
| KV453841.1 | 3117240 | - | - | 309088 | - | |
| NC_003424.3 | 5579133 | - | - | 557880 | - | |
1Only one chromosome was used from each of the fungal genomes.
*Denotes fungal species.
§Denotes species not included in the test database.
Comparison of classification tools using simulated datasets from Table 1.
| Dataset | Tool | TP | FP | TN | FN | Sensitivity | Specificity | Time (sec) |
|---|---|---|---|---|---|---|---|---|
| Standard | BLAST | 3501029 | 4 | 31509237 | 63779 | 0.982108714 | 1144.06 | |
| Standard | DIAMOND | 2625609 | 5598 | 23675230 | 939199 | 0.736535881 | 0.999763606 | 631.34 |
| Standard | Kraken 31 | 3554377 | 31 | 32082661 | 10431 | 0.997073896 | 0.999999034 | 135.4 |
| Standard | Kraken 16 | 3563611 | 41 | 32082651 | 1197 | 0.999998722 | 219.47 | |
| Standard | Kaiju | 2942976 | 2332 | 32080360 | 621832 | 0.825563677 | 0.999927313 | |
| RNA-seq | BLAST | 2706255 | 0 | 24356295 | 35011 | 0.987228164 | 813.14 | |
| RNA-seq | DIAMOND | 2537754 | 120 | 22840746 | 203512 | 0.92575985 | 0.999994746 | 497.66 |
| RNA-seq | Kraken 31 | 2734158 | 0 | 24671394 | 7108 | 0.997407037 | 93.38 | |
| RNA-seq | Kraken 16 | 2741261 | 2 | 24671392 | 5 | 0.999999919 | 243.92 | |
| RNA-seq | Kaiju | 2723973 | 333 | 24671061 | 17293 | 0.993691601 | 0.999986503 | |
| Spiked | BLAST | 3501363 | 2646 | 31536017 | 63477 | 0.982287271 | 0.999914998 | 1445.13 |
| Spiked | DIAMOND | 2626340 | 170647 | 25167565 | 938500 | 0.729845366 | 0.993133657 | 831.33 |
| Spiked | Kraken 31 | 3554057 | 2582 | 52379078 | 10783 | 0.997034142 | ||
| Spiked | Kraken 16 | 3563615 | 1288299 | 51093361 | 1225 | 0.975408061 | 424.59 | |
| Spiked | Kaiju | 2944335 | 66520 | 52315140 | 620505 | 0.819370262 | 0.99871138 | 280.58 |
1For Kraken 31, the test database was divided into 32 individual databases.
2Number of reads classified as TP: true positives, FP: false positives, TN: true negatives, FN: false negatives.
3sensitivity: TP/(TP + FN), specificity: TN/(TN + FP)
4CPU time in seconds. The best sensitivity, specificity, and time for each dataset are highlighted in bold.
Fig 1Sequence reads assigned to the fungal pathogen Puccinia triticina are derived from a transposable element.
Maximum likelihood tree comparing the Copia transposable element from a number of plant genomes and the fungus P. triticina (shaded). Bootstrap values out of 100 are shown at nodes. Species, chromosome accession, and nucleotide coordinates are displayed. The tree was generated in SeaView using PhyML with the generalized time-reversible (GTR) evolution model using Gblocks and 100 bootstraps.
Fig 2Distinguishing true and false positives using genomic read distribution.
(A) Reads classified as C. tropicalis mapped against the C. tropicalis MYA-3404 genome. The reads (6,656) were gathered by combining all reads assigned to C. tropicalis from the datasets ERR675617 and ERR670622. (B) Reads classified as T. islandicus mapped against the T. islandicus genome. The reads (7,000) are from the dataset ERR675670. All reads in each analysis were concatenated into a single pseudo-chromosome (orange chromosome with the shortest radius) with 20 ambiguous nucleotides (N) separating each read. The chromosomes in both A and B are colored with a red-to-blue color spectrum. The T. islandicus label names are abbreviated (e.g. 12.1 displayed instead of CVMT010000012.1). BLAST hits are shown as green links connecting a read with a genomic sequence. The plots were generated using Circos [35].
Fig 3FindFungi v0.23 pipeline overview.
Reads are downloaded in FASTQ format. Low quality reads are removed with Skewer [37]. The remaining reads are converted into FASTA format, which are analyzed by 32 implementations of Kraken, each using a different database [26]. The 32 Kraken predictions for each fungal read are consolidated, and a consensus prediction is assigned. Reads not predicted as fungal are removed. The best hit for each read is mapped to a pseudo-assembly of the relevant genome using BLAST [21]. Species where BLAST displays hits on more than 30% of pseudo-chromosomes are retained. Pearson’s coefficient of skewness is calculated to identify non-randomly distributed reads. Species with a skewness score between -0.2 and 0.2 (minimal skew) are retained. Fungal predictions, statistics and summary plots are written to a PDF file, and fungal prediction statistics are also written to a CSV file.
Fungal predictions from metagenomics datasets by FindFungi v0.23.
| Source | Total dataset reads | Predicted fungal reads | Fungal predictions (no. of reads) | |
|---|---|---|---|---|
| ERR1135318 | 86432970 | 380 | ||
| ERR1135427 | 23597054 | 491 | ||
| ERR1135453 | 59108986 | 1863 | ||
| ERR1135454 | 30677741 | 3335 | ||
| ERR1135455 | 57177310 | 1521 | ||
| ERR1135750 | 437278 | 46 | ||
| ERR1223845 | 62054282 | 25105 | ||
| ERR248260 | 134577030 | 35352 | ||
| ERR248262 | 141428756 | 116 | ||
| ERR571345 | 5074590 | 122 | ||
| ERR675346 | 731620 | 6156 | ||
| ERR675408 | 907429 | 2339 | ||
| ERR675411 | 809560 | 2986 | ||
| ERR675415 | 857596 | 88 | ||
| ERR675422 | 280130 | 60 | ||
| ERR675423 | 360841 | 95 | ||
| ERR675429 | 511455 | 95 | ||
| ERR675603 | 35832380 | 57 | ||
| ERR675608 | 30598678 | 404 | ||
| ERR675609 | 29666898 | 13451 | ||
| ERR675612 | 3883030 | 2314 | ||
| ERR675617 | 27007988 | 11589 | ||
| ERR675618 | 27288536 | 341 | ||
| ERR675622 | 23395904 | 9753 | ||
| ERR675624 | 16893482 | 1314 | ||
| ERR675626 | 21805514 | 910 | ||
| mgm4721951.3 | 1726909 | 157390 | ||
| mgm4721952.3 | 2867433 | 411 | ||
| mgm4721953.3 | 2119288 | 229853 | ||
| mgm4721954.3 | 3215171 | 412 | ||
| mgm4721955.3 | 1105951 | 1558 | ||
| mgm4721956.3 | 1097260 | 263 | ||
| mgm4721957.3 | 2059400 | 27267 | ||
| mgm4721958.3 | 1294113 | 1364 | ||
| mgm4721959.3 | 358379 | 190 | ||
| mgm4721960.3 | 1067649 | 5899 | ||
| mgm4721961.3 | 1686048 | 28885 | ||
| mgm4721962.3 | 2063872 | 6260 | ||
| mgm4721963.3 | 2287098 | 633283 | ||
| - | 844345609 | 1213318 |
1ERR1135227, ERR1135237, ERR1135245, ERR1135256, ERR1135268, ERR1135269, ERR1135291, ERR1135346, ERR1135368, ERR1135372, ERR1135406, ERR1135418, ERR1135429, ERR1135449, ERR1135459, ERR1135749, ERR1223846, ERR675430, ERR675519, ERR675529, ERR675568, ERR675616, ERR675632, ERR675653, ERR675654, ERR675670, ERR675674, ERR675677, ERR675680, ERR675682, ERR675683 had no fungal reads.
Fig 4Candida loboi and Candida tropicalis are isolates of the same species.
Maximum likelihood tree of a concatenated five-protein alignment from species from the Candida Gene Order Browser (CGOB; [46]) and C. loboi. Five genes (ERG1, MEF1, CEF3, DEG1, GCD14) that are conserved in all CGOB species were chosen at random. All C. loboi orthologs were identified with best BLAST matches using C. tropicalis gene homologs. Protein sequences were aligned using Muscle (v3.8.31, [47]) and concatenated. The tree was generated in SeaView [48] using PhyML with the LG evolution model using Gblocks [49] and 100 bootstraps (shown at nodes). Species abbreviations are displayed at branch leaves.