| Literature DB >> 30723610 |
Paul Greenfield1,2,3, Nai Tran-Dinh1, David Midgley1.
Abstract
INTRODUCTION: Whole-metagenome sequencing can be a rich source of information about the structure and function of entire metagenomic communities, but getting accurate and reliable results from these datasets can be challenging. Analysis of these datasets is founded on the mapping of sequencing reads onto known genomic regions from known organisms, but short reads will often map equally well to multiple regions, and to multiple reference organisms. Assembling metagenomic datasets prior to mapping can generate much longer and more precisely mappable sequences but the presence of closely related organisms and highly conserved regions makes metagenomic assembly challenging, and some regions of particular interest can assemble poorly. One solution to these problems is to use specialised tools, such as Kelpie, that can accurately extract and assemble full-length sequences for defined genomic regions from whole-metagenome datasets.Entities:
Keywords: Amplicons; Community structure; In-silico PCR; Metagenomes; Targeted assembly
Year: 2019 PMID: 30723610 PMCID: PMC6359901 DOI: 10.7717/peerj.6174
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1Pseudocode for Kelpie extension phase.
Read extension decision statistics for three CSM datasets.
| W1 | W2 | W3 | W1 | W2 | W3 | |
|---|---|---|---|---|---|---|
| read extension checks | 20517124 | 3562698 | 8402118 | |||
| single choice at | 19700062 | 3518077 | 8222559 | 96.0% | 98.7% | 97.9% |
| single choice at | 66318 | 15118 | 77026 | 0.3% | 0.4% | 0.9% |
| single choice at | 72763 | 12650 | 11615 | 0.4% | 0.4% | 0.1% |
| single choice at | 17655 | 715 | 10485 | 0.1% | 0.0% | 0.1% |
| single choice at | 34950 | 1488 | 4043 | 0.2% | 0.0% | 0.0% |
| single choice at | 19292 | 11136 | 21155 | 0.1% | 0.3% | 0.3% |
| single choice at | 18425 | 1128 | 7107 | 0.1% | 0.0% | 0.1% |
| single choice at | 117494 | 0 | 7540 | 0.6% | 0.0% | 0.1% |
| single choice at | 97699 | 1273 | 3204 | 0.5% | 0.0% | 0.0% |
| single kMer choice | 20144658 | 3561585 | 8364734 | 98.2% | 100.0% | 99.6% |
| looked downstream | 238307 | 962 | 28396 | 1.2% | 0.0% | 0.3% |
| single good downstream | 134009 | 46 | 7436 | 0.7% | 0.0% | 0.1% |
| chose in proportion by depth | 104293 | 909 | 19487 | 0.5% | 0.0% | 0.2% |
| chose longest downstream | 5 | 7 | 1473 | 0.0% | 0.0% | 0.0% |
| # of starting reads | 19750 | 15876 | 23324 | |||
| # of reads abandoned | 145 | 28 | 79 | 0.7% | 0.2% | 0.3% |
| # of fully extended reads | 19605 | 15848 | 23245 | 99.3% | 99.8% | 99.7% |
Top 25 most abundant organisms found in EBI project ERP008951.
The first part of the table comes from the community profile generated by the EBI Metagenomics Portal, and the second part is from an OTU table produced from Kelpie-generated data. The highlighted cells were only resolved to a taxonomic level above Species.
| Top 25 EBI Species | |||||
|---|---|---|---|---|---|
| Family | Genus | Species | Sum | ||
| Unclassified_f | Unclassified_g | Unclassified_sp | 96675 | ||
| Bacteroidaceae | Bacteroides | Bacteroides_sp | 242238 | ||
| Lachnospiraceae | Lachnospiraceae_g | Lachnospiraceae_sp | 100484 | ||
| Prevotellaceae | Prevotella | Prevotella copri | 86982 | ||
| Ruminococcaceae | Faecalibacterium | Faecalibacterium prausnitzii | 70676 | ||
| Ruminococcaceae | Ruminococcaceae_g | Ruminococcaceae_sp | 68736 | ||
| Clostridiales_f | Clostridiales_g | Clostridiales_sp | 67346 | ||
| Lachnospiraceae | Lachnospira | Lachnospira_sp | 38338 | ||
| Enterobacteriaceae | Enterobacteriaceae_g | Enterobacteriaceae_sp | 30687 | ||
| Bacteroidaceae | Bacteroides | Bacteroides uniformis | 24942 | ||
| Lachnospiraceae | Blautia | Blautia_sp | 24770 | ||
| Sutterellaceae | Sutterella | Sutterella_sp | 24151 | ||
| Porphyromonadaceae | Parabacteroides | Parabacteroides_sp | 23702 | ||
| Lachnospiraceae | Coprococcus | Coprococcus_sp | 21015 | ||
| Ruminococcaceae | Ruminococcus | Ruminococcus_sp | 20449 | ||
| Prevotellaceae | Prevotella | Prevotella_sp | 15063 | ||
| Lachnospiraceae | Roseburia | Roseburia_sp | 13965 | ||
| Porphyromonadaceae | Parabacteroides | Parabacteroides distasonis | 13588 | ||
| Rikenellaceae | Rikenellaceae_g | Rikenellaceae_sp | 13216 | ||
| Ruminococcaceae | Oscillospira | Oscillospira_sp | 12402 | ||
| Veillonellaceae | Dialister | Dialister_sp | 11469 | ||
| Selenomonadaceae | Megamonas | Megamonas_sp | 9059 | ||
| Enterobacteriaceae | Klebsiella | Klebsiella_sp | 9045 | ||
| Lachnospiraceae | Dorea | Dorea_sp | 8362 | ||
| Bacteroidaceae | Bacteroides | Bacteroides ovatus | 7631 | ||
Figure 2Order-level comparison between taxonomic profiles for EBI project ERP008951.
(A) Bar chart showing the most abundant Orders found by the EBI pipeline and in the Kelpie-based OTU table. (B) Scatter plot for the same data. Extracted from the spreadsheet in Table S1 and plots generated by STAMP.
Extract from CSM OTU table (amplicons and extended reads).
The first 25 of 228 rows of the Coal Seam Metagenome OTU table found in Table S2. The ‘amp’ columns are amplicon counts; the ‘ext’ columns are counts of Kelpie extended reads. Counts in bold indicate that the OTU consensus sequence was not completely covered by WGS reads.
| OTU | Size | Species | W1 amp | W1 ext | W2 amp | W2 ext | W3 amp | W3 ext |
|---|---|---|---|---|---|---|---|---|
| 1 | 43,603 | Desulfuromonas acetexigens (T) ( | 27333 | 13574 | 0 | 1554 | 1010 | |
| 2 | 24,970 | Thermodesulfovibrio aggregans (T) TGE-P1 ( | 0 | 17120 | 7816 | 0 | ||
| 3 | 10,514 | Treponema zuelzerae (T) type strain: DSM 1903; 2 ( | 0 | 1171 | 305 | 5956 | 3069 | |
| 4 | 10,163 | Methanobacterium subterraneum (T) A8p, DSM 11074 ( | 0 | 0 | 7736 | 2393 | ||
| 5 | 7,081 | Cytophaga fermentans (T) ATCC 19072 ( | 0 | 5845 | 1220 | 0 | ||
| 7 | 6,514 | Methanosaeta harundinacea (T) 8Ac ( | 1032 | 192 | 0 | 3332 | 1942 | |
| 6 | 6,264 | Parabacteroides distasonis (T) JCM 5825 ( | 1270 | 271 | 0 | 3116 | 1598 | |
| 8 | 5,520 | Thermacetogenium phaeum (T) PB ( | 0 | 0 | 3285 | 2216 | ||
| 10 | 4,837 | candidate division OP1 clone OPB14 ( | 0 | 4057 | 771 | 0 | ||
| 12 | 4,611 | Lysinibacillus sp. LAM612 ( | 0 | 0 | 533 | 4068 | ||
| 9 | 4,258 | Methanosarcina siciliae type strain: DSM3028 ( | 1238 | 2733 | 0 | 54 | 222 | |
| 13 | 3,847 | Methanocalculus pumilus (T) MHT-1 ( | 3312 | 476 | 0 | 0 | ||
| 11 | 3,652 | Desulfotomaculum acetoxidans (T) DSM 771 ( | 0 | 2463 | 1177 | 0 | ||
| 14 | 3,390 | Syntrophaceticus schinkii (T) Sp3 ( | 0 | 2871 | 506 | 0 | ||
| 15 | 3,383 | Methanobacterium aarhusense (T) H2-LR ( | 0 | 3104 | 271 | 0 | ||
| 17 | 3,012 | Methanothermobacter thermoflexus (T) IDZ, VKM B-1963, DSM 7268 ( | 0 | 2685 | 326 | 0 | ||
| 16 | 2,920 | Sulfurospirillum alkalitolerans HTRB-L1 ( | 2340 | 508 | 0 | 0 | ||
| 21 | 2,114 | Methanobacterium alcaliphilum (T) NBRC 105226 ( | 0 | 1161 | 100 | 586 | 265 | |
| 18 | 2,099 | Clostridium hungatei (T) AD; ATCC 700212 ( | 0 | 0 | 1124 | 965 | ||
| 20 | 2,067 | Natronincola peptidivorans (T) Z-7031 ( | 0 | 0 | 1293 | 754 | ||
| 19 | 1,955 | Pontibacter sp. JC215 A10 ( | 4 | 0 | 0 | 931 | 1018 | |
| 23 | 1,734 | Porphyromonas pogonae strain MI 10-1288 ( | 1059 | 128 | 0 | 389 | 129 | |
| 25 | 1,557 | Acetobacterium malicum (T) DSM 4132 ( | 929 | 304 | 0 | 153 | 155 | |
| 22 | 1,515 | Desulfovibrio oxamicus (T) DSM 1925 ( | 10 | 0 | 860 | 624 | ||
| 24 | 1,513 | 0 | 955 | 556 | 0 |
Figure 3Agreement between amplicon and Kelpie-based OTUs for CSM datasets.
(A) Percentages ordered by cumulative read count for the four ‘AE’ OTU tables in Table S6 (samples combined, and processed separately). In the combined table, the first OTU without supporting counts from both amplicons and Kelpie-extended reads comes after 98.8% of the amplicons reads have been assigned to OTUs (83rd OTU in reverse cumulative size order), and represents 0.03% of the amplicon reads. (B) PCA plot showing the similarity between the amplicon and Kelpie-based profiles.
Details from the identity comparisons between the amplicons and Kelpie-generated OTU centroid sequences for the W2 CSM dataset.
The centroid sequences for OTUs 20 and 29 are slightly different, although within the 97% similarity threshold. Closer examination of the sequence ‘clouds’ that were clustered together to form these OTUs showed that these apparent differences arose as a result of the choice of different centroid sequences rather than the Kelpie and amplicon being actually different and distinct.
| OTU | Size | Kelpie species | Id% | Amplicon species |
|---|---|---|---|---|
| 1 | 7816 | Thermodesulfovibrio aggregans (T) TGE-P1 ( | 100 | Thermodesulfovibrio aggregans (T) TGE-P1 ( |
| 2 | 1220 | Cytophaga fermentans (T) ATCC 19072 ( | 100 | Cytophaga fermentans (T) ATCC 19072 ( |
| 3 | 1169 | Desulfotomaculum acetoxidans (T) DSM 771 ( | 100 | Desulfotomaculum acetoxidans (T) DSM 771 ( |
| 4 | 847 | Moorella humiferrea (T) 64 FGQ ( | 100 | Moorella humiferrea (T) 64 FGQ ( |
| 5 | 771 | candidate division OP1 clone OPB14 ( | 100 | candidate division OP1 clone OPB14 ( |
| 6 | 556 | – | 100 | – |
| 7 | 506 | Syntrophaceticus schinkii (T) Sp3 ( | 100 | Syntrophaceticus schinkii (T) Sp3 ( |
| 8 | 419 | Thermodesulfovibrio aggregans (T) TGE-P1 ( | 100 | Thermodesulfovibrio aggregans (T) TGE-P1 ( |
| 9 | 408 | Ignavibacterium album (T) Mat9-16 ( | 100 | Ignavibacterium album (T) Mat9-16 ( |
| 10 | 326 | Methanothermobacter thermoflexus (T) IDZ, VKM B-1963, DSM 7268 ( | 100 | Methanothermobacter thermoflexus (T) IDZ, VKM B-1963, DSM 7268 ( |
| 11 | 305 | Treponema zuelzerae (T) type strain: DSM 1903; 2 ( | 100 | Treponema zuelzerae (T) type strain: DSM 1903; 2 ( |
| 12 | 271 | Methanobacterium aarhusense (T) H2-LR ( | 100 | Methanobacterium aarhusense (T) H2-LR ( |
| 13 | 211 | – | 100 | – |
| 20 | 108 | Dethiobacter alkaliphilus (T) AHT 1 ( | 98 | Dethiobacter alkaliphilus (T) AHT 1 ( |
| 14 | 100 | Methanobacterium alcaliphilum (T) NBRC 105226 ( | 100 | Methanobacterium alcaliphilum (T) NBRC 105226 ( |
| 15 | 100 | – | 100 | – |
| 16 | 98 | Thermodesulfovibrio yellowstonii (T) YP87 ( | 100 | Thermodesulfovibrio yellowstonii (T) YP87 ( |
| 17 | 85 | Pelotomaculum propionicicum (T) MGP ( | 100 | Pelotomaculum propionicicum (T) MGP ( |
| 18 | 82 | Sunxiuqinia faeciviva (T) JAM-BA0302 ( | 100 | Sunxiuqinia faeciviva (T) JAM-BA0302 ( |
| 19 | 79 | Thermanaerothrix daxensis strain GNS-1 ( | 100 | Thermanaerothrix daxensis strain GNS-1 ( |
| 21 | 65 | Smithella propionica (T) LYP ( | 100 | Smithella propionica (T) LYP ( |
| 22 | 60 | Caldicoprobacter oshimai (T) JW/HY-331 ( | 100 | Caldicoprobacter oshimai (T) JW/HY-331 ( |
| 23 | 48 | Bellilinea caldifistulae (T) GOMI-1 ( | 100 | Bellilinea caldifistulae (T) GOMI-1 ( |
| 24 | 37 | Leptolinea tardivitalis (T) YMTK-2 ( | 100 | Leptolinea tardivitalis (T) YMTK-2 ( |
| 25 | 36 | uncultured bacterium KF-JG30-18 ( | 100 | uncultured bacterium KF-JG30-18 ( |
| 26 | 35 | Desulfotomaculum kuznetsovii strain 17 ( | 100 | Desulfotomaculum kuznetsovii strain 17 ( |
| 27 | 29 | Dethiobacter alkaliphilus (T) AHT 1 ( | 100 | Dethiobacter alkaliphilus (T) AHT 1 ( |
| 28 | 21 | Acidobacteria bacterium P105 ( | 100 | Acidobacteria bacterium P105 ( |
| 29 | 21 | Olegusella massiliensis strain KHD7 ( | 99.6 | Olegusella massiliensis strain KHD7 ( |
| 30 | 19 | Syntrophorhabdus aromaticivorans (T) UI ( | 100 | Syntrophorhabdus aromaticivorans (T) UI ( |
Summary of identity comparisons between centroid OTU sequences for the 3 CSM datasets.
The small number of not-identical species appear to be caused by the clustering algorithm choosing different consensus sequences from within a cluster of strain-level variants. There are a total of 3 OTUs that are found by Kelpie that do not appear in the amplicon data.
| W1 | W2 | W3 | ||||
|---|---|---|---|---|---|---|
| #OTUs | 39 | 30 | 57 | |||
| 100% identical | 36 | 92% | 28 | 93% | 47 | 82% |
| same species (97%+) | 1 | 3% | 2 | 7% | 4 | 7% |
| same genus (95%+) | 1 | 3% | 0 | 0% | 4 | 7% |
| not in amplicons | 1 | 3% | 0 | 0% | 2 | 4% |
Figure 4Numbers of OTUs found in the top 98% (A) and 99% (B) of the community profile for each of the three samples.
Numbers of OTUs are ranked by cumulative read count and derived from the three ‘AESS-W’ OTU tables in Table S6. The OTU counts have been adjusted by removing amplicon OTUs that have incomplete WGS read coverage.
Comparisons for the CAMI Low Complexity dataset.
Comparison between the organisms named in the CAMI ‘gold’ profile, the corresponding classified rRNA V4 regions extracted from the CAMI-provided assembled contigs, and the classified Kelpie ‘amplicons’. Any species in the CAMI profile whose 16S rRNA V4 region could not be found in the provided contigs has been removed from this table.
| CAMI gold profile | V4 region from contigs | Kelpie profile | ||||
|---|---|---|---|---|---|---|
| Species | Abnd. | Species/strain | Cov% | Species/strain | Reads | Abnd. |
| Schwartzia succinivorans | 28.2% | Schwartzia succinivorans strain S1-1 ( | 100 | Schwartzia succinivorans strain S1-1 ( | 615 | 26.3% |
| Hydrotalea sandarakina | 19.8% | Hydrotalea sandarakina strain AF-51 ( | 100 | Hydrotalea sandarakina strain AF-51 ( | 759 | 32.5% |
| Tetrasphaera duodecadis | 14.9% | Tetrasphaera duodecadis strain IAM 14868 ( | 100 | Tetrasphaera duodecadis strain IAM 14868 ( | 255 | 10.9% |
| Bacillales sp | 9.2% | Exiguobacterium acetylicum strain DSM 20416 ( | 100 | Exiguobacterium acetylicum strain DSM 20416 ( | 169 | 7.2% |
| Janthinobacterium sp. | 7.8% | Massilia namucuonensis strain 333-1-0411 ( | 100 | Massilia namucuonensis strain 333-1-0411 ( | 132 | 5.6% |
| Pseudomonas aeruginosa | 6.0% | Pseudomonas aeruginosa strain DSM 50071 ( | 100 | Pseudomonas aeruginosa strain DSM 50071 ( | 108 | 4.6% |
| Paracoccus denitrificans | 3.7% | Paracoccus denitrificans strain 381 ( | 100 | Paracoccus denitrificans strain 381 ( | 74 | 3.2% |
| Defluviimonas denitrificans | 3.0% | Defluviimonas denitrificans strain D9-3 ( | 100 | Defluviimonas denitrificans strain D9-3 ( | 48 | 2.1% |
| Desulfatibacillum alkenivorans | 1.9% | Desulfatibacillum alkenivorans strain PF2803 ( | 100 | Desulfatibacillum alkenivorans strain PF2803 ( | 42 | 1.8% |
| Actinomycetales sp. | 1.1% | Williamsia phyllosphaerae strain C7 ( | 100 | Williamsia phyllosphaerae strain C7 ( | 8 | 0.3% |
| Flavisolibacter ginsengisoli | 1.8% | Flavisolibacter ginsengisoli strain Gsoil 643 ( | 100 | Flavisolibacter ginsengisoli strain Gsoil 643 ( | 83 | 3.6% |
| Tepidibacter formicigenes | 0.7% | Tepidibacter formicigenes strain DV1184 ( | 100 | Tepidibacter formicigenes strain DV1184 ( | 11 | 0.5% |
| Albidovulum xiamenense | 0.4% | Albidovulum xiamenense strain YBY-7 ( | 100 | Albidovulum xiamenense strain YBY-7 ( | 1 | 0.0% |
| Xylella fastidiosa | 0.4% | Xylella fastidiosa strain PCE-FF ( | 100 | Xylella fastidiosa strain PCE-FF ( | 17 | 0.7% |
| Lampropedia hyalina | 0.4% | Lampropedia hyalina strain IAM 14890 ( | 97 | |||
| Lysobacter oryzae | 0.3% | Lysobacter oryzae strain YC6269 ( | 100 | Lysobacter oryzae strain YC6269 ( | 16 | 0.7% |
| Anaerobranca californiensis | 0.2% | Anaerobranca zavarzinii strain JW/VK-KS5Y ( | 96 | |||
| Nonlabens dokdonensis | 0.1% | Nonlabens dokdonensis ( | 50 |
Recall and precision statistics for the CAMI Low and Medium Complexity datasets.
| CAMI low complexity | CAMI medium complexity | |||||
|---|---|---|---|---|---|---|
| Present in contigs | Present & fully covered by WGS reads | Top 99% by abundance | Present in contigs | Present & fully covered by WGS reads | Top 99% by abundance | |
| #organisms | 18 | 15 | 14 | 71 | 51 | 57 |
| both (TP) | 15 | 15 | 14 | 51 | 51 | 49 |
| added(FP) | 0 | 0 | 0 | 0 | 0 | 0 |
| missing(FN) | 3 | 0 | 0 | 20 | 0 | 8 |
| Precision | 100% | 100% | 100% | 100% | 100% | 100% |
| Recall | 83% | 100% | 100% | 72% | 100% | 86% |