| Literature DB >> 34320186 |
Mosè Manni1,2, Matthew R Berkeley1,2, Mathieu Seppey1,2, Felipe A Simão1,2, Evgeny M Zdobnov1,2.
Abstract
Methods for evaluating the quality of genomic and metagenomic data are essential to aid genome assembly procedures and to correctly interpret the results of subsequent analyses. BUSCO estimates the completeness and redundancy of processed genomic data based on universal single-copy orthologs. Here, we present new functionalities and major improvements of the BUSCO software, as well as the renewal and expansion of the underlying data sets in sync with the OrthoDB v10 release. Among the major novelties, BUSCO now enables phylogenetic placement of the input sequence to automatically select the most appropriate BUSCO data set for the assessment, allowing the analysis of metagenome-assembled genomes of unknown origin. A newly introduced genome workflow increases the efficiency and runtimes especially on large eukaryotic genomes. BUSCO is the only tool capable of assessing both eukaryotic and prokaryotic species, and can be applied to various data types, from genome assemblies and metagenomic bins, to transcriptomes and gene sets.Entities:
Keywords: completeness; eukaryotes; genome; metagenomes; microbes; prokaryotes; quality assessment; transcriptome; viruses
Mesh:
Year: 2021 PMID: 34320186 PMCID: PMC8476166 DOI: 10.1093/molbev/msab199
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
Number of odb9 and odb10 BUSCO Data Sets.
| Taxonomic Group | odb9 (v3) | odb10 (v4/5) |
|---|---|---|
| Bacteria | 16 | 83 |
| Archaea | 0 | 16 |
| Viruses | 0 | 27 |
| Eukaryota | 33 | 67 |
| Protist | 2 | 7 |
| Fungi | 10 | 24 |
| Plants | 1 | 9 |
| Metazoa | 14 | 26 |
| Arthropoda | 5 | 8 |
| Vertebrata | 7 | 15 |
| Total | 49 | 193 |
Note.—The odb10 version greatly expanded the number of benchmarking data sets.
Fig. 1Comparison of the number of complete BUSCOs obtained by running BUSCO v5 and v3 with BUSCO odb_10 and odb_9 data sets on (a) bacterial, (b) fungal, and (c) metazoan gene sets.
Fig. 2(a) Comparisons of BUSCO scores obtained on a set of fungal genomes using the two available workflows for eukaryotic species. The percentage on the y axis corresponds to the complete BUSCOs for the BUSCO_MetaEuk (orange) and BUSCO_Augustus (white) workflows. Assessments on gene sets are also displayed for comparison (green). Genomes were assessed using the most specific available data sets, which are displayed at the top of each subpanel. The newly introduced BUSCO_MetaEuk workflow allows faster assessments, see supplementary figure 3a, Supplementary Material online, for the differences in runtimes. (b and c) Effect of using different MetaEuk sensitivity values on BUSCO_Metaeuk runtimes and completeness estimation for 112 arthropod genomes evaluated with their most specific BUSCO data set. The default values are set at s = 4.5 and s = 6 for the first and the second MetaEuk runs, respectively. For the analyses, the same sensitivity value displayed on the y axis was used for both MetaEuk runs. The axis corresponding to runtimes (in seconds) is log-transformed.
Fig. 3BUSCO assessment on microbial data and comparison with CheckM. (a) Accuracy in the choice of data set produced by the auto-lineage mode when analyzing bacterial and archaeal assemblies (n = 436). For a given assembly, there can be between one and four suitable data sets (from the more general, root data set, down to the more specific one) to choose from (x axis). The selected data set is considered as “correct” when it is the most lineage-specific available for the genome; “suboptimal” when a parent lineage is selected; and “in disagreement with the NCBI” when the selected lineage is not part of the NCBI taxonomic annotation of that genome. This might indicate an error; however, 12 out of 19 genomes in this category are annotated by NCBI as “unclassified,” while sharing a parent lineage with the BUSCO selected data set; e.g. assembly GCF_000153385.1 is an unclassified Flavobacteria and was assigned to flavobacteriales_odb10 data set (also see supplementary table 7, Supplementary Material online). When supported by a high BUSCO score, this suggests that the data set selected by BUSCO was appropriate. (b and c) Comparison of BUSCO and CheckM completeness (blue) and redundancy (red) scores on a set of 436 genomes. For clarity, the two scatterplots are zoomed in on the areas of highest densities. n represents the number of data points displayed in the zoomed area. (d) Memory requirements for running BUSCO with the auto-lineage workflow on a set of bacterial and fungal genomes.
Fig. 4Benchmarking BUSCO estimates on artificially depleted genomes and gene sets of Drosophila melanogaster assessed with the diptera_odb10 data set. (a) Artificial depletion was made on the full gene set. (b) Artificial depletion exclusively made on genes matching BUSCO markers. For both panels, solid red lines indicate the expected missing values. Five randomly depleted versions were used for each level of depletion. (c) Precisions of the predictions for the analyses of panel (b).