Literature DB >> 34320186

BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes.

Mosè Manni^1,2, Matthew R Berkeley^1,2, Mathieu Seppey^1,2, Felipe A Simão^1,2, Evgeny M Zdobnov^1,2.

Abstract

Methods for evaluating the quality of genomic and metagenomic data are essential to aid genome assembly procedures and to correctly interpret the results of subsequent analyses. BUSCO estimates the completeness and redundancy of processed genomic data based on universal single-copy orthologs. Here, we present new functionalities and major improvements of the BUSCO software, as well as the renewal and expansion of the underlying data sets in sync with the OrthoDB v10 release. Among the major novelties, BUSCO now enables phylogenetic placement of the input sequence to automatically select the most appropriate BUSCO data set for the assessment, allowing the analysis of metagenome-assembled genomes of unknown origin. A newly introduced genome workflow increases the efficiency and runtimes especially on large eukaryotic genomes. BUSCO is the only tool capable of assessing both eukaryotic and prokaryotic species, and can be applied to various data types, from genome assemblies and metagenomic bins, to transcriptomes and gene sets.

Entities: Chemical

Keywords: completeness; eukaryotes; genome; metagenomes; microbes; prokaryotes; quality assessment; transcriptome; viruses

Mesh：

Year: 2021 PMID： 34320186 PMCID： PMC8476166 DOI： 10.1093/molbev/msab199

Source DB: PubMed Journal: Mol Biol Evol ISSN： 0737-4038 Impact factor: 16.240

Advances in sequencing technologies are powering accumulation of genomics data at an accelerating rate, from sequencing isolates and single cells to metagenomes of entire microbial communities. In turn, computational genomics approaches are essential to digest such molecular data into our evolving understanding of genome diversity across the tree of life, from microbes to animals and plants. Evaluating the quality of genomic data, in terms of completeness and redundancy, is critical for subsequent analyses and for the correct interpretation of the results. Complementing technical measures like the N50 value, biologically meaningful metrics based on expected gene content have proved to be useful for estimating the quality of genomes, as exemplified by our BUSCO tool (Simão et al. 2015; Waterhouse et al. 2018), the now discontinued CEGMA (Parra et al. 2007), CheckM (Parks et al. 2015) aimed at prokaryotes, EukCC (Saary et al. 2020) proposed for microbial eukaryotes, and CheckV (Nayfach et al. 2020) for viruses. The latest BUSCO versions introduce new functionalities for assessments of eukaryotic, prokaryotic, and viral data, along with improvements in runtimes and user experience. The underlying data sets have been renewed and expanded in sync with the OrthoDB v10 release (Kriventseva et al. 2019; Zdobnov et al. 2021), providing coverage of many more lineages and a revised baseline with increased data sampling.

New Approaches

Here, we describe the new functionalities and datasets introduced after the release of BUSCO v3. With respect to v3, the last BUSCO version, v5, features: 1) a major upgrade of the underlying data sets in sync with OrthoDB v10; 2) an updated workflow for the assessment of prokaryotic and viral genomes using the gene predictor Prodigal (Hyatt et al. 2010); 3) an alternative workflow for the assessment of eukaryotic genomes using the gene predictor MetaEuk (Levy Karin et al. 2020); 4) a workflow to automatically select the most appropriate BUSCO data set, enabling the analysis of sequences of unknown origin; 5) an option to run batch analysis of multiple inputs to facilitate high-throughput assessments of large data sets and metagenomic bins; and 6) a major refactoring of the code, and maintenance of two distribution channels on Bioconda (Grüning et al. 2018) and Docker (Merkel 2014). These developments make BUSCO suitable for comprehensive analyses of large heterogeneous data sets, from large eukaryotic genomes to metagenome-assembled genomes of microbial eukaryotes, prokaryotes, and viruses.

Results and Discussion

Upgrade of Data Sets

The benchmarking data sets of single-copy orthologs were revised and expanded using the v10 release of OrthoDB (www.orthodb.org), which provides evolutionary and functional annotations of orthologs among the most comprehensive sampling of genomic diversity. The creation of novel data sets was necessary to cover more lineages with higher resolution data sets and to revise the existing ones with increased and evenly sampled data across the phylogenetic tree. Following the strategy devised previously (Simão et al. 2015; Waterhouse et al. 2018) and taking advantage of the greatly increased number of representative species, we compiled 193 odb10 (OrthoDB v10) data sets (table 1), more than a 3-fold increase over odb9 sets, and comprising overall a 5-fold increase in the number of BUSCO marker genes (hereafter BUSCOs) derived from more than twice as many species as in the previous data sets version. Supplementary table 1 lists the available BUSCO odb10 data sets along with the number of markers and species used to construct the sets. BUSCO v5 also includes 27 viral data sets, supporting the analysis of a subset of viruses. We compared the estimates of completeness of BUSCO v5 with v3 for a self-validation of the major data sets on gene sets of Bacteria, Fungi, and Metazoa (fig. 1 and supplementary table 2, Supplementary Material online). Figure 1 shows a good concordance overall, especially for Metazoa. Slightly more conservative estimates of v5 over v3 for fungi can be explained by the higher number of markers in odb10 BUSCO data sets compared with odb9, and by a more than 6-fold increase in the number of species. Figure 1 shows a cluster of fungal genomes with BUSCO v5 scores diverging from v3. These species belong to Microsporidia, a group of early diverging fungi once thought to be protozoans (Wadi and Reinke 2020). Their genomes are known to have a reduced set of genes that are commonly present in fungi and other eukaryotes. The discrepancy between v3 and v5 reflects the increased number of markers for the fungi_odb10 data set which is more balanced toward the majority of fungal clades (supplementary fig. 1, Supplementary Material online). The “microsporidia_odb10” panel in figure 2 displays the assessment of microsporidian genomes with the most specific data set microsporidia_odb10, which yields a more accurate assessment of these genomes. This example highlights the importance of using the most specific BUSCO data set available for the species of interest, as large differences in terms of gene content can often occur within higher taxonomic levels. Nevertheless, to obviate the biased estimation when the fungi_odb10 is used on microsporidians we also added a “parasitic check” that recalculates the scores based on the list of fungal markers missing in these species (see supplementary notes, Supplementary Material online).

Table 1

Number of odb9 and odb10 BUSCO Data Sets.

Taxonomic Group	odb9 (v3)	odb10 (v4/5)
Bacteria	16	83
Archaea	0	16
Viruses	0	27
Eukaryota	33	67
Protist	2	7
Fungi	10	24
Plants	1	9
Metazoa	14	26
Arthropoda	5	8
Vertebrata	7	15
Total	49	193

Note.—The odb10 version greatly expanded the number of benchmarking data sets.

Fig. 1

Comparison of the number of complete BUSCOs obtained by running BUSCO v5 and v3 with BUSCO odb_10 and odb_9 data sets on (a) bacterial, (b) fungal, and (c) metazoan gene sets.

Fig. 2

(a) Comparisons of BUSCO scores obtained on a set of fungal genomes using the two available workflows for eukaryotic species. The percentage on the y axis corresponds to the complete BUSCOs for the BUSCO_MetaEuk (orange) and BUSCO_Augustus (white) workflows. Assessments on gene sets are also displayed for comparison (green). Genomes were assessed using the most specific available data sets, which are displayed at the top of each subpanel. The newly introduced BUSCO_MetaEuk workflow allows faster assessments, see supplementary figure 3a, Supplementary Material online, for the differences in runtimes. (b and c) Effect of using different MetaEuk sensitivity values on BUSCO_Metaeuk runtimes and completeness estimation for 112 arthropod genomes evaluated with their most specific BUSCO data set. The default values are set at s = 4.5 and s = 6 for the first and the second MetaEuk runs, respectively. For the analyses, the same sensitivity value displayed on the y axis was used for both MetaEuk runs. The axis corresponding to runtimes (in seconds) is log-transformed.

Comparison of the number of complete BUSCOs obtained by running BUSCO v5 and v3 with BUSCO odb_10 and odb_9 data sets on (a) bacterial, (b) fungal, and (c) metazoan gene sets. (a) Comparisons of BUSCO scores obtained on a set of fungal genomes using the two available workflows for eukaryotic species. The percentage on the y axis corresponds to the complete BUSCOs for the BUSCO_MetaEuk (orange) and BUSCO_Augustus (white) workflows. Assessments on gene sets are also displayed for comparison (green). Genomes were assessed using the most specific available data sets, which are displayed at the top of each subpanel. The newly introduced BUSCO_MetaEuk workflow allows faster assessments, see supplementary figure 3a, Supplementary Material online, for the differences in runtimes. (b and c) Effect of using different MetaEuk sensitivity values on BUSCO_Metaeuk runtimes and completeness estimation for 112 arthropod genomes evaluated with their most specific BUSCO data set. The default values are set at s = 4.5 and s = 6 for the first and the second MetaEuk runs, respectively. For the analyses, the same sensitivity value displayed on the y axis was used for both MetaEuk runs. The axis corresponding to runtimes (in seconds) is log-transformed. Number of odb9 and odb10 BUSCO Data Sets. Note.—The odb10 version greatly expanded the number of benchmarking data sets.

Novel BUSCO Workflows

The BUSCO software was revised and new functionalities introduced to enable the analysis of inputs of unknown taxonomic origin, and to improve assessments, throughput and runtimes. A breakdown of the features in v5 and the differences with v3 are described in the schema of supplementary figure 2, supplementary table 3, and supplementary notes, Supplementary Material online.

New Prokaryotic and Eukaryotic Workflows

The BUSCO software was revised to integrate the gene predictor Prodigal (Hyatt et al. 2010) for improving the analysis of prokaryotic genomes. The use of Prodigal coupled with the expanded number of data sets now available for Bacteria and Archaea (83 more data sets with respect to v3) make BUSCO a suitable choice when assessing prokaryotic data. A common issue when predicting genes is to select the correct genetic code (i.e. translation table) for the species under analysis. Each prokaryotic BUSCO data set now contains information on the potential genetic codes characteristic of the species within the corresponding lineage. BUSCO selects the most likely genetic code automatically based on which code yields the highest coding density. BUSCO v5 features a new workflow for the analysis of eukaryotic genomes that employs the gene predictor MetaEuk (Levy Karin et al. 2020), which relies on MMseqs2 (Steinegger and Söding 2017). MetaEuk was introduced to improve the assessment of large genomes for which the previous workflow was suffering from long runtimes. Two consecutive MetaEuk runs are implemented, and parameters are tuned differently for the second run to search for BUSCO genes missing after the first run. The BUSCO_MetaEuk workflow is the default option in v5, as it allows faster assessments. The Augustus gene predictor (Stanke et al. 2008) is still available in BUSCO and can be selected by specifying “–augustus” when running the analysis in “genome” mode. Since the two workflows use gene predictors that are based on different methods, it is expected to obtain nonidentical results when assessing the same genome. Nevertheless, they produce comparable completeness estimations on genomic sequences, which at times outperform the completeness of corresponding gene sets (fig. 2 and supplementary fig. 3, supplementary table 4, and supplementary notes, Supplementary Material online). Figure 2 compares the results of BUSCO v5 using the two gene predictors on fungal genomes and their corresponding gene sets. Comparisons on arthropod and protist data are reported in supplementary figure 3b and c, Supplementary Material online. On a set of 139 arthropod genomes, we investigated if there is a set of BUSCOs consistently missed by one workflow and found by the other one. There were no consistent major differences in the ability to predict specific BUSCOs on this set, apart from a couple of exceptions (supplementary notes and supplementary fig. 4, Supplementary Material online). Supplementary figure 3, Supplementary Material online, shows the substantial improvement in runtimes (runtimes axis has a log10 scale) when using the BUSCO_MetaEuk workflow. The higher speed is especially useful when assessing large genomes. For example, BUSCO_MetaEuk runs in 8 h and 50 min (using 56 CPUs) on the 10.7-Gb genome of the wheat Triticum dicoccoides (accession: GCF_002162155.1) with the poales_odb10 data set (BUSCO score: C : 99.1% [S : 9.6%, D : 89.5%], F : 0.1%, M : 0.8%, n : 4896), whereas the BUSCO_ Augustus workflow takes several days to complete. Even faster analyses can be obtained by reducing the sensitivity value (-s) of the two runs, via the “–metaeuk_parameters” and “–metaeuk_rerun_parameters” options (default values are s = 4.5 and s = 6 for the first and second run, respectively) (fig. 2). For example, the runtime on the T. dicoccoides genome decreases to 2 h and 24 min using a sensitivity value of s = 3 (BUSCO score: C : 98.7% [S : 9.6%, D : 89.1%], F : 0.2%, M : 1.1%, n : 4896). However, changing the sensitivity values can have an impact on the estimates (fig. 2 and supplementary table 5, Supplementary Material online). The default values were chosen as a trade-off between accuracy and runtimes. In most settings it is not advisable to change the sensitivity values in order to keep BUSCO results comparable. Nevertheless, having this option can be convenient when assessing very large genomes or for getting faster evaluations on preliminary assemblies. Assembling genomes is an iterative process in which multiple draft assemblies are often produced to compare the outcome of different parameters/pipelines. Using smaller sensitivity values facilitates quick draft BUSCO assessments that were not feasible before, and should speed up the overall genome assembly procedure. The results now report the workflow used for the analysis, and this should be specified along with the BUSCO data set when reporting scores in publications.

Auto-Lineage Workflow: An All-in-One Option for Quality Assessment of (Meta)Genomic Data

BUSCO v5 is the only available tool that can assess genomic data from the three domains of life in a single analysis by using the new “–auto-lineage” function. This is achieved through the phylogenetic placement of the input sequence (genome, gene set, or transcriptome) on a set of precomputed phylogenetic trees using SEPP (Mirarab et al. 2011) and pplacer (Matsen et al. 2010). Subsequently, BUSCO automatically attempts to select the most specific (i.e. highest resolution) data set available for the species of interest. This solves a major problem when analyzing metagenomic data where the taxonomic origin of metagenome-assembled genomes is often unknown and both eukaryotic and prokaryotic genomes can be present in the sample. For ease of batch analyses, it is now possible to run BUSCO on multiple input sequences, and an additional table summarizing the scores for all inputs is returned. Additionally, BUSCO can automatically detect a subset of viruses belonging to clades of the 27 newly introduced viral data sets. Supplementary figure 5 and supplementary table 6, Supplementary Material online, show the assessment on RefSeq (Brister et al. 2015) viral genomes and gene sets using the “–auto-lineage” function. We plan to expand the virus pipeline in future to allow assessments of a broader set of viruses. An overview of the auto-lineage workflow can be found in the supplementary notes, Supplementary Material online. Figure 3 presents the performance, in terms of selecting the right data set, of the auto-lineage procedure for 436 bacterial/archaeal genomes (supplementary table 7, Supplementary Material online). We compared BUSCO v5 with CheckM v1.1.3 (Parks et al. 2015), which, while conceptually similar, is only capable of performing assessments on bacterial and archaeal data. Figure 3 and c reports BUSCO and CheckM completeness and redundancy scores for the same set of 436 genomes (see also supplementary table 8, Supplementary Material online). BUSCO estimates appear to be more conservative for some data points. This is in part related to the higher resolution data sets (i.e. more markers) automatically selected for the assessment (supplementary fig. S6, Supplementary Material online). Overall, on prokaryotic data, BUSCO has comparable results to CheckM. However, a major advantage of BUSCO is the ability to detect and assess eukaryotic microbial genomes.

Fig. 3

BUSCO assessment on microbial data and comparison with CheckM. (a) Accuracy in the choice of data set produced by the auto-lineage mode when analyzing bacterial and archaeal assemblies (n = 436). For a given assembly, there can be between one and four suitable data sets (from the more general, root data set, down to the more specific one) to choose from (x axis). The selected data set is considered as “correct” when it is the most lineage-specific available for the genome; “suboptimal” when a parent lineage is selected; and “in disagreement with the NCBI” when the selected lineage is not part of the NCBI taxonomic annotation of that genome. This might indicate an error; however, 12 out of 19 genomes in this category are annotated by NCBI as “unclassified,” while sharing a parent lineage with the BUSCO selected data set; e.g. assembly GCF_000153385.1 is an unclassified Flavobacteria and was assigned to flavobacteriales_odb10 data set (also see supplementary table 7, Supplementary Material online). When supported by a high BUSCO score, this suggests that the data set selected by BUSCO was appropriate. (b and c) Comparison of BUSCO and CheckM completeness (blue) and redundancy (red) scores on a set of 436 genomes. For clarity, the two scatterplots are zoomed in on the areas of highest densities. n represents the number of data points displayed in the zoomed area. (d) Memory requirements for running BUSCO with the auto-lineage workflow on a set of bacterial and fungal genomes. In terms of resources, the mean runtime per genome with BUSCO (on the 436 prokaryotic genomes, in batch mode and using the prokaryotic-specific “–auto-lineage-prok”) is less than a minute (e.g. 35 and 48 s with 30 and eight CPUs, respectively) (supplementary tables 9 and 10, Supplementary Material online). Running BUSCO through a workflow management system can considerably reduce the overall runtime. We provide an example using Snakemake (Mölder et al. 2021) at https://gitlab.com/ezlab/plugins_buscov5. With this setup and allowing a total number of 30 CPUs with five CPUs per task, the overall runtime for completing the same assessment was reduced to 95 min (a mean runtime of 13 s per genome) compared with 257 min. Figure 3 shows the memory requirements for assessing a set of bacterial and fungal genomes with the BUSCO auto-lineage workflow. The memory requirements do not exceed 11GB for bacterial genomes, so that they can be assessed on laptops with limited memory and CPU resources (e.g. the same assessment ran to completion on a MacBook Pro with 16GB and eight cores). This is an advantage in comparison to the 70 GB of memory required by CheckM (supplementary table 9, Supplementary Material online). The percentage of duplicated markers reported by BUSCO can reflect technical or biological duplications (i.e. redundant markers derive from the same genome), or contaminations from other species/strains. BUSCO cannot directly distinguish whether redundant BUSCOs are due to duplications or contamination. Nevertheless the sequences detected by BUSCO as “duplicated” can be used as evidence for further investigation. In general, a high “duplication” score for prokaryotic data is more likely to be caused by contamination rather than gene duplication, especially for metagenome-assembled genomes, which often require manual refinements (Eren et al. 2015). Figure 3 shows an overall concordance between BUSCO “duplication” and CheckM “contamination” estimates. Additionally, BUSCO v5 can highlight contamination deriving from species belonging to other domains by means of the scores obtained from assessing the input with the three “root” data sets (bacteria_odb10, archaea_odb10, eukaryota_odb10). These assessments are automatically performed as the first step of the “auto-lineage” workflow. As there can be a background level of cross-matches between data sets, it is expected to have BUSCOs scoring in multiple “root” data sets. For example, running the “root” data sets on a set of 2,779 bacterial genomes from RefSeq (O’Leary et al. 2016) results in a median complete BUSCO score of 18.5% and 4.3% for the archaea_odb10 and eukaryota_odb10 data sets, respectively (supplementary table 10, Supplementary Material online). The frequency of matches (including those reported as “fragmented”) for each BUSCO is shown in supplementary figure 7, Supplementary Material online. 214 (83.9%) of the eukaryota_odb10 BUSCOs are detected in 5% or fewer of the bacterial genomes, with 184 (72.2%) with no matches at all. 120 (61.9%) of the archaea_odb10 BUSCOs are detected in 5% or fewer of the bacterial genomes, with 87 (44.84%) never showing up (supplementary table 11, Supplementary Material online). The percentage of complete matches from bacteria_odb10 and archaea_odb10 in eukaryotic genomes is, on average, higher, e.g. 12.1% and 21.1% in fungal genomes and of 30.7% and 62.4% in arthropod gene sets, respectively (supplementary fig. 8 and supplementary tables 10 and 11, Supplementary Material online), making it less useful to spot contaminants. Nevertheless, high completeness scores from these data sets may alert users to the potential presence of contaminant species or horizontal gene transfer events from other domains, whereas high duplication scores may indicate the presence of multiple contaminant species.

Benchmark of Predictions

To assess the precision of the BUSCO estimates, we benchmarked BUSCO v5 predictions on gene sets and genomes artificially depleted of randomly selected genes. Briefly, we randomly removed 0/10/30/50% of the genes in the official gene set, generating five simulated versions for each level of depletion. The corresponding genes in the genome were masked using the coordinates from the GFF file. BUSCO v5 was run on the simulated gene sets and genomes (both BUSCO_Augustus and BUSCO_MetaEuk workflows) using the most specific data set and the most generic one (for more details, see supplementary notes, Supplementary Material online). Figure 4 shows an example of the results of the benchmarks for Drosophila melanogaster (assembly accession: GCF_000001215.4). The predicted BUSCO estimates on depleted gene sets and genomes have a good overall correspondence to the expected values when using the diptera_odb10 data set, the most specific and appropriate data set. When using the more generic eukaryota_odb10 data set, the estimates are subjected to more variability (supplementary fig. 9a, Supplementary Material online) across the different depleted versions, which is expected and explained by the lower number of markers, and the correspondingly lower coverage of the genome/gene set. This further highlights the importance of using the most specific data set when possible.

Fig. 4

Benchmarking BUSCO estimates on artificially depleted genomes and gene sets of Drosophila melanogaster assessed with the diptera_odb10 data set. (a) Artificial depletion was made on the full gene set. (b) Artificial depletion exclusively made on genes matching BUSCO markers. For both panels, solid red lines indicate the expected missing values. Five randomly depleted versions were used for each level of depletion. (c) Precisions of the predictions for the analyses of panel (b). A second slightly different type of benchmarking was used to compute the number of false positive (FP)/false negative (FN) predictions. In this case we exclusively depleted genes predicted to be BUSCO markers. We first mapped the gene set to the OrthoDB level of interest (e.g. if assessing the species with diptera_odb10, the gene set was mapped to the OrthoDB diptera level). Based on this ortholog mapping we depleted the gene set/assemblies by 0/10/30/50/100% of the predicted BUSCO genes, generating five versions for each depletion level (except for 100% depletion). BUSCO v5 was run on these simulated data and the FP/FN and precision estimates were computed using the initial OrthoDB mapping as ground truth. Figure 4 shows the overall congruent estimates on D. melanogaster data. Removing 100% of the potential markers results in a small percentage of false positives, with the newly introduced BUSCO_MetaEuk workflow having a smaller number of false positives. All three modes have similar precision (fig. 4), with BUSCO_MetaEuk workflow showing a slightly higher precision when depletion is equal to or above 50%. Supplementary figure 10, Supplementary Material online, shows the results of the two benchmarks on the yeast Saccharomyces cerevisiae (GCF_000146045.2). A detailed description of the benchmark procedures is reported in the supplementary notes, Supplementary Material online.

Other Improvements and Distribution

Along with a major refactoring of the code, new options have been added for managing the increasing number of data sets and to facilitate the analysis procedure. In particular, we added a default option to automatically download the necessary precomputed files for phylogenetic placement, and the data sets, either by specifying “-l ” as an option on the command line when initiating a BUSCO run, or by running in auto-lineage mode. Documentation and software setup instructions are all described in detail at https://busco.ezlab.org/busco_userguide.html. We now also maintain a BUSCO package on Bioconda (https://anaconda.org/bioconda/busco) and a Docker container. We encourage BUSCO users to favor these two approaches to control the version of each software dependency that is used. In addition, the BUSCO code is still distributed on GitLab https://gitlab.com/ezlab/busco.

Materials and Methods

BUSCO data sets are available at https://busco-data.ezlab.org/v5/data/lineages/. Each BUSCO data set contains the details on the species, orthologous groups, and genes used to construct the set. Versions and accessions of all the genome assemblies and gene sets, and the BUSCO main results analyzed as part of this study are listed in the supplementary tables, Supplementary Material online. Further details on the analyses are described in the supplementary notes, Supplementary Material online. Plots presenting the results of the analyses were made using the ggplot2 package (Wickham 2009) in R (R Core Team 2020).

Supplementary Material

Supplementary data are available at Molecular Biology and Evolution online. Click here for additional data file.

20 in total

1. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes.

Authors: Genis Parra; Keith Bradnam; Ian Korf
Journal: Bioinformatics Date: 2007-03-01 Impact factor: 6.937

2. Using native and syntenically mapped cDNA alignments to improve de novo gene finding.

Authors: Mario Stanke; Mark Diekhans; Robert Baertsch; David Haussler
Journal: Bioinformatics Date: 2008-01-24 Impact factor: 6.937

3. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs.

Authors: Felipe A Simão; Robert M Waterhouse; Panagiotis Ioannidis; Evgenia V Kriventseva; Evgeny M Zdobnov
Journal: Bioinformatics Date: 2015-06-09 Impact factor: 6.937

4. Bioconda: sustainable and comprehensive software distribution for the life sciences.

Authors: Björn Grüning; Ryan Dale; Andreas Sjödin; Brad A Chapman; Jillian Rowe; Christopher H Tomkins-Tinch; Renan Valieris; Johannes Köster
Journal: Nat Methods Date: 2018-07 Impact factor: 28.547

5. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.

Authors: Martin Steinegger; Johannes Söding
Journal: Nat Biotechnol Date: 2017-10-16 Impact factor: 54.908

6. SEPP: SATé-enabled phylogenetic placement.

Authors: S Mirarab; N Nguyen; T Warnow
Journal: Pac Symp Biocomput Date: 2012

7. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes.

Authors: Donovan H Parks; Michael Imelfort; Connor T Skennerton; Philip Hugenholtz; Gene W Tyson
Journal: Genome Res Date: 2015-05-14 Impact factor: 9.043

8. OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs.

Authors: Evgenia V Kriventseva; Dmitry Kuznetsov; Fredrik Tegenfeldt; Mosè Manni; Renata Dias; Felipe A Simão; Evgeny M Zdobnov
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

9. Evolution of microsporidia: An extremely successful group of eukaryotic intracellular parasites.

Authors: Lina Wadi; Aaron W Reinke
Journal: PLoS Pathog Date: 2020-02-13 Impact factor: 6.823

10. OrthoDB in 2020: evolutionary and functional annotations of orthologs.

Authors: Evgeny M Zdobnov; Dmitry Kuznetsov; Fredrik Tegenfeldt; Mosè Manni; Matthew Berkeley; Evgenia V Kriventseva
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

174 in total

1. Stress response gene family expansions correlate with invasive potential in teleost fish.

Authors: Taylor R Stanley; Karen S Kim Guisbert; Sabrina M Perez; Morgan Oneka; Isabela Kernin; Nicole R Higgins; Alexandra Lobo; Munevver M Subasi; David J Carroll; Ralph G Turingan; Eric Guisbert
Journal: J Exp Biol Date: 2022-03-08 Impact factor: 3.312

2. HGT is widespread in insects and contributes to male courtship in lepidopterans.

Authors: Yang Li; Zhiguo Liu; Chao Liu; Zheyi Shi; Lan Pang; Chuzhen Chen; Yun Chen; Ronghui Pan; Wenwu Zhou; Xue-Xin Chen; Antonis Rokas; Jianhua Huang; Xing-Xing Shen
Journal: Cell Date: 2022-07-18 Impact factor: 66.850

3. Comparative Genomics Reveal the High Conservation and Scarce Distribution of Nitrogen Fixation nif Genes in the Plant-Associated Genus Herbaspirillum.

Authors: Ana Marina Pedrolo; Filipe Pereira Matteoli; Cláudio Roberto Fônseca Sousa Soares; Ana Carolina Maisonnave Arisi
Journal: Microb Ecol Date: 2022-08-06 Impact factor: 4.192

4. Genomic insights into the phylogeny and biomass-degrading enzymes of rumen ciliates.

Authors: Zongjun Li; Xiangnan Wang; Yu Zhang; Zhongtang Yu; Tingting Zhang; Xuelei Dai; Xiangyu Pan; Ruoxi Jing; Yueyang Yan; Yangfan Liu; Shan Gao; Fei Li; Youqin Huang; Jian Tian; Junhu Yao; XvPeng Xing; Tao Shi; Jifeng Ning; Bin Yao; Huoqing Huang; Yu Jiang
Journal: ISME J Date: 2022-08-19 Impact factor: 11.217

Review 5. mRatBN7.2: familiar and unfamiliar features of a new rat genome reference assembly.

Authors: Tristan V de Jong; Hao Chen; Wesley A Brashear; Kelli J Kochan; Andrew E Hillhouse; Yaming Zhu; Isha S Dhande; Elizabeth A Hudson; Mary H Sumlut; Melissa L Smith; Theodore S Kalbfleisch; Peter A Doris
Journal: Physiol Genomics Date: 2022-05-11 Impact factor: 4.297

6. Complete Genome of Sphingomonas aerolata PDD-32b-11, Isolated from Cloud Water at the Summit of Puy de Dôme, France.

Authors: Domitille Jarrige; Thierry Nadalig; Muriel Joly; Martine Sancelme; Stéphane Vuilleumier; Pierre Amato; Françoise Bringel
Journal: Microbiol Resour Announc Date: 2022-09-15

7. Obligate sexual reproduction of a homothallic fungus closely related to the Cryptococcus pathogenic species complex.

Authors: Andrew Ryan Passer; Shelly Applen Clancey; Terrance Shea; Márcia David-Palma; Anna Floyd Averette; Teun Boekhout; Betina M Porcel; Minou Nowrousian; Christina A Cuomo; Sheng Sun; Joseph Heitman; Marco A Coelho
Journal: Elife Date: 2022-06-17 Impact factor: 8.713