| Literature DB >> 35191776 |
Anupam Gautam1,2, Hendrik Felderhoff1, Caner Bağci1,2, Daniel H Huson1,2,3.
Abstract
In microbiome analysis, one main approach is to align metagenomic sequencing reads against a protein reference database, such as NCBI-nr, and then to perform taxonomic and functional binning based on the alignments. This approach is embodied, for example, in the standard DIAMOND+MEGAN analysis pipeline, which first aligns reads against NCBI-nr using DIAMOND and then performs taxonomic and functional binning using MEGAN. Here, we propose the use of the AnnoTree protein database, rather than NCBI-nr, in such alignment-based analyses to determine the prokaryotic content of metagenomic samples. We demonstrate a 2-fold speedup over the usage of the prokaryotic part of NCBI-nr and increased assignment rates, in particular assigning twice as many reads to KEGG. In addition to binning to the NCBI taxonomy, MEGAN now also bins to the GTDB taxonomy. IMPORTANCE The NCBI-nr database is not explicitly designed for the purpose of microbiome analysis, and its increasing size makes its unwieldy and computationally expensive for this purpose. The AnnoTree protein database is only one-quarter the size of the full NCBI-nr database and is explicitly designed for metagenomic analysis, so it should be supported by alignment-based pipelines.Entities:
Keywords: AnnoTree; NCBI-nr; alignment; function; functional analysis; microbiome analysis; protein sequences; software; taxonomy
Year: 2022 PMID: 35191776 PMCID: PMC8862659 DOI: 10.1128/msystems.01408-21
Source DB: PubMed Journal: mSystems ISSN: 2379-5077 Impact factor: 6.496
FIG 1For prokaryotic species detected during analysis, we report the approximate relative abundance of organisms in the mock community (green) and the number of bases aligned and assigned by DIAMOND+MEGAN in an NCBI-nr run (blue) and in an AnnoTree run (orange).
Accession numbers and total number of reads for each data set
| Data set | Accession no. | Total no. of reads | Reads with DIAMOND alignments | Ratio | |||
|---|---|---|---|---|---|---|---|
| AnnoTree (no.) | % | NCBI-nr (no.) | % | ||||
| River1 |
| 646,178 | 410,118 | 63.5 | 406,913 | 63.0 | 1.0 |
| River2 |
| 129,753,222 | 90,535,941 | 69.8 | 88,403,713 | 68.1 | 1.0 |
| Seagrass |
| 98,260,754 | 36,053,215 | 36.7 | 33,717,202 | 34.3 | 1.1 |
| Skin |
| 22,827,626 | 13,403,495 | 58.7 | 14,122,490 | 61.9 | 0.9 |
| Stool |
| 33,214,614 | 29,132,562 | 87.7 | 30,101,313 | 90.6 | 1.0 |
| Soil |
| 97,595,185 | 10,992,188 | 11.3 | 7,264,223 | 7.4 | 1.5 |
| Thermal Pools |
| 52,908,626 | 15,751,382 | 29.8 | 16,625,446 | 31.4 | 0.9 |
| Bioreactor1 |
| 99,998,110 | 73,151,916 | 73.1 | 72,806,515 | 72.8 | 1.0 |
| Bioreactor2 |
| 44,258,996 | 36,608,649 | 82.7 | 37,477,641 | 84.7 | 1.0 |
| Bioreactor3 |
| 694,827 | 613,958 | 88.4 | 616,536 | 88.7 | 1.0 |
| Total | 580,158,138 | 306,653,424 | 52.86 | 301,541,992 | 51.98 | 1.02 | |
For both the AnnoTree and NCBI-nr protein databases, we report the number and percentage of reads that obtained an alignment using DIAMOND. We also report the ratio between the two numbers.
Long-read data set.
Assigned reads
| Classification | AnnoTree run | NCBI-nr run | Ratio | ||||
|---|---|---|---|---|---|---|---|
| Assigned | % of R | % of Al | Assigned | % of R | % of Al | ||
| NCBI taxonomy | 305,150,157 | 52.6 | 99.5 | 297,539,333 | 51.3 | 98.7 | 1.0 |
| GTDB taxonomy | 303,770,449 | 52.4 | 99.1 | 282,269,816 | 48.6 | 93.6 | 1.1 |
| EC | 78,874,545 | 13.6 | 25.7 | 76,552,285 | 13.2 | 25.4 | 1.0 |
| eggNOG | 95,932,149 | 16.5 | 31.3 | 87,131,284 | 15.0 | 28.9 | 1.1 |
| InterPro | 142,250,858 | 24.5 | 46.4 | 143,885,580 | 24.8 | 47.7 | 1.0 |
| KEGG | 209,371,499 | 36.1 | 68.3 | 123,130,673 | 21.2 | 40.8 | 1.7 |
| SEED | 102,452,692 | 17.7 | 33.4 | 100,615,086 | 17.3 | 33.4 | 1.0 |
For each of the classifications provided by MEGAN and summarized over all AnnoTree runs and NCBI-nr runs of the DIAMOND+MEGAN pipeline on the 10 data sets listed in Table 1, we report the number of assigned reads (assigned), the percentage of all reads (% of R), and the percentage of all aligned reads (% of Al). In the last column, we report the ratio of the reads assigned using AnnoTree and NCBI-nr, respectively.
FIG 2Details of the assignment of reads to the NCBI taxonomy. For each of the 10 data sets, for the total set of reads assigned by either an AnnoTree run or NCBI run of the DIAMOND+MEGAN pipeline, we show the proportion of reads only assigned by the AnnoTree run (green), assigned by both runs to the same taxon (yellow), or only assigned by the NCBI run (gray). For reads with differing assignments, we show the proportion assigned to incompatible lineages (dotted) or two compatible lineages with either the AnnoTree assignment being more specific (vertical stripes) or the NCBI-nr assignment being more specific (horizontal stripes). On the right, we indicate the total number of reads and the number of reads assigned by either the AnnoTree or NCBI-nr run.
FIG 3Details of the assignment of reads to the GTDB taxonomy, using the same colors as those in Fig. 2.
FIG 4Details of the assignment of reads to KEGG. For each of the 10 data sets, for the total set of reads assigned by either an AnnoTree run or NCBI run of the DIAMOND+MEGAN pipeline, we show the proportion of reads only assigned by the AnnoTree run (green), assigned by both runs to the same class (yellow) or different classes (olive), or only assigned by the NCBI run (gray).
CPU time used for running DIAMOND, Meganizer, and both combined
| DIAMOND | Meganizer | DIAMOND+MEGAN | ||||||
|---|---|---|---|---|---|---|---|---|
| NCBI-nr | AnnoTree | Ratio | NCBI-nr | AnnoTree | Ratio | NCBI-nr | AnnoTree | Ratio |
| 125,288 min | 61,443 min | 2.0 | 2,241 min | 2,404 min | 0.9 | 127,529 min | 63,847 min | 2.0 |
Summarizing all 10 data sets, we show the CPU time used for running DIAMOND, Meganizer, and both combined during either an NCBI-nr run (restricted to prokaryotic sequences) or an AnnoTree run of the DIAMOND+MEGAN pipeline.
SRA run ID, sequencing platform, read layout, and total number of reads
| Data set | SRA run ID | Platform | Layout | Total no. of reads |
|---|---|---|---|---|
| River1 |
| LS454 | Single | 646,178 |
| River2 |
| Illumina | Paired | 129,753,222 |
| Seagrass |
| Illumina | Paired | 98,260,754 |
| Skin |
| Illumina | Single | 22,827,626 |
| Stool |
| Illumina | Paired | 33,214,614 |
| Soil |
| Illumina | Paired | 97,595,185 |
| Thermal pools |
| Illumina | Paired | 52,908,626 |
| Bioreactor1 |
| Illumina | Paired | 99,998,110 |
| Bioreactor2 |
| Illumina | Paired | 44,258,996 |
| Bioreactor3 |
| ONT | Single | 694,827 |
Number of prokaryotic accessions in NCBI-nr or AnnoTree that have map to a class in the classification systems
| Classification | NCBI-nr (no.) | AnnoTree (no.) | Ratio |
|---|---|---|---|
| NCBI taxonomy | 182,329,414 | 106,052,079 | 1.72 |
| GTDB taxonomy | 126,956,422 | 106,052,079 | 1.2 |
| EC | 4,501,593 | 2,962,187 | 1.51 |
| eggNOG | 4,274,800 | 3,506,041 | 1.21 |
| InterPro | 19,748,423 | 11,069,757 | 1.78 |
| KEGG | 8,218,708 | 56,577,432 | 0.15 |
| SEED | 31,117,272 | 16,183,436 | 1.92 |
For two different taxonomical classifications (NCBI and GTDB) and for five different functional classifications (EC, EGG, InterPro, KEGG, and SEED) supported by MEGAN, we report the number of prokaryotic accessions in NCBI-nr or AnnoTree that have a mapping to a class in the classification and the corresponding ratio.
MEGAN ultimate edition.