| Literature DB >> 31581946 |
Bruce Parrello1,2, Rory Butler3, Philippe Chlenski4, Robert Olson3, Jamie Overbeek1,3, Gordon D Pusch1, Veronika Vonstein1, Ross Overbeek1,2.
Abstract
BACKGROUND: Recent advances in high-volume sequencing technology and mining of genomes from metagenomic samples call for rapid and reliable genome quality evaluation. The current release of the PATRIC database contains over 220,000 genomes, and current metagenomic technology supports assemblies of many draft-quality genomes from a single sample, most of which will be novel. DESCRIPTION: We have added two quality assessment tools to the PATRIC annotation pipeline. EvalCon uses supervised machine learning to calculate an annotation consistency score. EvalG implements a variant of the CheckM algorithm to estimate contamination and completeness of an annotated genome.We report on the performance of these tools and the potential utility of the consistency score. Additionally, we provide contamination, completeness, and consistency measures for all genomes in PATRIC and in a recent set of metagenomic assemblies.Entities:
Keywords: CheckM; Genome annotation; Genome quality; Machine learning; Metagenomics; RAST; Random forest; Supervised learning
Year: 2019 PMID: 31581946 PMCID: PMC6775668 DOI: 10.1186/s12859-019-3068-y
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Role correlations. Heatmap of role-role correlations for a subset of roles clustered according to the dendrogram clustering method in R. Roles are arranged according to their positions in a dendrogram (not shown) computed according to their mutual correlations. In particular, roles that are clustered together in the dendrogram will appear close to one another in the diagram; borders with high contrast correspond to divisions between higher-order clusters. This algorithm maximizes contrast in the heatmap at such boundaries and results in light-colored blocks of strongly correlated roles. High correlations along the diagonal correspond to highly conserved small sets of roles, e.g. subunits of a single protein complex, and all roles are fully correlated with themselves (ρ=1). While it is apparent from visual inspection of the blocks in the heatmap that there is an underlying structure to these role-role correlations, the actual nature of this structure can be nonapparent and difficult to characterize precisely. EvalCon uses machine learning to learn these structures from role-role correlations, thereby eliminating the need for an a priori characterization
Fig. 2Map of the process of training EvalCon given a machine learning algorithm and a set of training roles. For the development of EvalCon in PATRIC, the training roles were kept constant, and a variety of machine learning predictors were tested with this process
Summary of machine learning algorithm performance
| Algorithm | Parameters | Training Time (s/role) | Roles >93% Accuracy | Avg. Accuracy (%) |
|---|---|---|---|---|
| Linear Discriminant Analysis | Default | 3.51 | 785 | 87.4 |
| Logistic Regression | Optimized | 9.16 | 1081 | 91.4 |
| Random Forest Regressor | Optimized | 1.35 | 1299 | 92.7 |
| ExtraTrees Classifier | Optimized | 0.89 | 1405 | 93.4 |
| XGBoost | Optimized | 7.40 | 1417 | 93.6 |
| Random Forest Classifier | Optimized | 1.01 | 1423 | 93.5 |
Fig. 3Sample problematic roles report. First six rows of a problematic roles report for a draft genome produced by the PATRIC metagenome binning service. The first four rows represent coarse inconsistencies: one role that is predicted but is not observed, and three roles which are observed but not predicted. The fifth row represents a fine inconsistency corresponding to an extra PEG, and the sixth represents a fine inconsistency corresponding to a missing PEG. Where applicable, the comment field notes universal roles, contig membership for observed roles, short contigs, contigs with no good roles, features appearing near the ends of contigs, and closest features on the reference genome
Fig. 4Fine consistency as a function of quality. Average fine consistency scores for 193 validation genomes under conditions of simulated incompleteness and contamination
Fig. 5Changes in predictor as a function of quality. Average percentage of predictions remaining constant for 193 validation genomes under conditions of simulated incompleteness and contamination