| Literature DB >> 35248155 |
Gleb Goussarov1,2, Jürgen Claesen1,3, Mohamed Mysara1, Ilse Cleenwerck2, Natalie Leys1, Peter Vandamme2, Rob Van Houdt4.
Abstract
BACKGROUND: Although the total number of microbial taxa on Earth is under debate, it is clear that only a small fraction of these has been cultivated and validly named. Evidently, the inability to culture most bacteria outside of very specific conditions severely limits their characterization and further studies. In the last decade, a major part of the solution to this problem has been the use of metagenome sequencing, whereby the DNA of an entire microbial community is sequenced, followed by the in silico reconstruction of genomes of its novel component species. The large discrepancy between the number of sequenced type strain genomes (around 12,000) and total microbial diversity (106-1012 species) directs these efforts to de novo assembly and binning. Unfortunately, these steps are error-prone and as such, the results have to be intensely scrutinized to avoid publishing incomplete and low-quality genomes.Entities:
Keywords: Alignment-free; Binning; DNA mock metagenome; Metagenomics; Quality control; Software
Year: 2022 PMID: 35248155 PMCID: PMC8898458 DOI: 10.1186/s40793-022-00403-7
Source DB: PubMed Journal: Environ Microbiome ISSN: 2524-6372
Datasets used in this study
| Dataset | Namea | Complexityb | Input materialc | Sequencing output | Read sourced | Assembly toole | Binning method | Binning parametersf |
|---|---|---|---|---|---|---|---|---|
| Training | HC227_Cc | 227 | gDNA evenly | 2 × 150 bp PE total: 60 Gb | ERS5705986 | SPAdes | ||
| HC227_Ccc | ||||||||
| HC227_Xcc | Ma | |||||||
| HC227_Mc | ||||||||
| HC227_Mcc | ||||||||
| Test | BMock12_Mc | 12 | gDNA unevenly | 2 × 150 bp PE total: 64 Gb | SRR8073716 | SPAdes | ||
| BMock12_Mcc | ||||||||
| Rinke_Mc | 54 | gDNA evenly | 2 × 150 bp PE Total: 13 Gb | Rinke et al. [ | SPAdes | |||
| Rinke_Mcc | ||||||||
| MBARC-26_Mc | 26 | gDNA unevenly | 2 × 150 bp PE total: 51.9 Gb | SRR3656745 | SPAdes | |||
| MBARC-26_Mcc | ||||||||
| ZymoCS_Mc | 10 | gDNA evenly | 2 × 150 bp PE total: 3 Gb | ERR2984773 | SPAdes | |||
| ZymoCS_Mcc | ||||||||
| Quince_Mc | 210 | Simulated reads unevenly | 2 × 150 bp PE total: 180 Gb | Quince et al. [ | MEGAHIT | |||
| Quince_Mcc |
aLetter code after underscore refers to binning method (upper case) and parameters (lower case)
bNumber of strains in the mock
cgDNA: genomic DNA, (un)evenly specifies the distribution of the individual inputs
dSRR (Sequence Read Archive accession number), ERR (European Nucleotide Archive accession number)
eSPAdes version 3.14, For MEGAHIT, assemblies were provided with the publication
fComp, composition; cov, coverage
Average accuracy of a quadratic discriminant model between the two most difficult to separate genomes within a set of five genomes
| Method | Size (kb) | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| 1 | 5 | 10 | 20 | 30 | 40 | 50 | 75 | 100 | |
| PaSiT4 | 0.61 | 0.62 | 0.65 | 0.71 | 0.74 | 0.79 | 0.88 | 0.92 | |
| MMZ3 | 0.65 | 0.78 | 0.90 | 0.92 | 0.94 | 0.95 | 0.97 | 0.99 | |
| MMZ4 | 0.60 | 0.76 | 0.92 | 0.93 | 0.95 | 0.97 | 0.95 | 0.96 | |
| Freq4 | 0.71 | 0.90 | 0.94 | 0.95 | 0.95 | 0.95 | 0.96 | 0.97 | |
Selected combinations are in bold
Fig. 1Graphical summary of the pre-processing steps used to evaluate the usability of a specified combination of fragment length and signature choice for a given set of five genomes. Genomes are split into fragments of a specified length and with specified overlap. For each fragment, each signature calculated using the target method is viewed as an observation and PCA is performed to reduce to two dimensions. Finally, QDA is performed between the two closest clusters made up of observations from the same genome and the accuracy of this classifier is produced
Fig. 2Bacterial strains from the HC227 mock community cover a large range of genome sizes and % GC. Colours indicate the class of each member
Fig. 3Comparison of the composition of the training (HC227) and test mocks (others). Species (red) and genera (grey) present in HC277 and the test mocks are connected. Each distinct phylum is represented by a separate colour, as are distinct taxonomic classes and orders. Phyla that are present in HC227 are marked with an asterisk in the legend and Phyla belonging to Archaea are indicated by a dark-grey band. Additional information is also provided for each strain in each mock, including its number (outside), genome size (dark grey) and GC content (light grey)
Fig. 4Closest analogues for completeness (a) and purity (b) obtained from the output of CheckM as a function of the actual values in the training dataset bins. Data points are coloured according to the binner used. The blue line is the best linear fit
Fig. 5Fraction of the HC227 assembly that can potentially be covered by the analysis for specified fragment lengths
Fig. 6Completeness and purity of bins generated using CONCOCT, MaxBin and MetaBAT2 based on composition (c) and on composition and coverage (cc)
Fig. 7Performance of CheckM (a and b), MAGISTA (c and d) and MAGISTIC (e and f) on all test datasets for completeness (a, c and e) and purity (b, d and f). The black and blue lines indicate the ideal performance and best linear fit, respectively. Data points are coloured in accordance with their taxonomic relation to the most related genome in the training/reference set of the method whose performance is shown
Performance of all models on the test dataset (all) and subsets containing real and simulated reads
| Bin statistic | Model | RMSE | |||||
|---|---|---|---|---|---|---|---|
| Real | Simulated | All | Real | Simulated | All | ||
| Completeness | CheckM | 0.744 | 0.612 | 0.685 | 17.28 | 22.54 | 20.05 |
| MAGISTA | 0.814 | 0.730 | 0.777 | 14.73 | 18.81 | 16.87 | |
| MAGISTIC | 0.905 | 0.836 | 0.873 | 10.52 | 14.68 | 12.75 | |
| Purity | CheckM | 0.722 | -0.261 | 0.143 | 7.74 | 30.61 | 22.21 |
| MAGISTA | 0.204 | 0.240 | 0.365 | 13.10 | 23.76 | 19.12 | |
| MAGISTIC | 0.672 | 0.234 | 0.449 | 8.41 | 23.85 | 17.80 | |
| F1 | CheckM | 0.778 | 0.536 | 0.666 | 14.85 | 23.46 | 19.58 |
| MAGISTA | 0.787 | 0.725 | 0.766 | 14.57 | 18.04 | 16.38 | |
| MAGISTIC | 0.884 | 0.775 | 0.834 | 10.75 | 16.32 | 13.79 | |