| Literature DB >> 31681437 |
Nidhi Shah1,2,3, Jacquelyn S Meisel1,2,3,4, Mihai Pop1,2,3,4.
Abstract
The advent of high throughput sequencing has enabled in-depth characterization of human and environmental microbiomes. Determining the taxonomic origin of microbial sequences is one of the first, and frequently only, analysis performed on microbiome samples. Substantial research has focused on the development of methods for taxonomic annotation, often making trade-offs in computational efficiency and classification accuracy. A side-effect of these efforts has been a reexamination of the bacterial taxonomy itself. Taxonomies developed prior to the genomic revolution captured complex relationships between organisms that went beyond uniform taxonomic levels such as species, genus, and family. Driven in part by the need to simplify computational workflows, the bacterial taxonomies used most commonly today have been regularized to fit within a standard seven taxonomic levels. Consequently, modern analyses of microbial communities are relatively coarse-grained. Few methods make classifications below the genus level, impacting our ability to capture biologically relevant signals. Here, we present ATLAS, a novel strategy for taxonomic annotation that uses significant outliers within database search results to group sequences in the database into partitions. These partitions capture the extent of taxonomic ambiguity within the classification of a sample. The ATLAS pipeline can be found on GitHub [https://github.com/shahnidhi/outlier_in_BLAST_hits]. We demonstrate that ATLAS provides similar annotations to phylogenetic placement methods, but with higher computational efficiency. When applied to human microbiome data, ATLAS is able to identify previously characterized taxonomic groupings, such as those in the class Clostridia and the genus Bacillus. Furthermore, the majority of partitions identified by ATLAS are at the subgenus level, replacing higher-level annotations with specific groups of species. These more precise partitions improve our detection power in determining differential abundance in microbiome association studies.Entities:
Keywords: 16S rRNA marker gene; classification; high-throughput sequencing; microbiome; taxonomy
Year: 2019 PMID: 31681437 PMCID: PMC6811648 DOI: 10.3389/fgene.2019.01022
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1Schematic diagram of the ATLAS pipeline. ATLAS takes in query sequences from a marker gene and searches them against a reference database to identify outlier sequences. It then constructs a graph of database sequences and clusters those that are commonly identified together into partitions.
Figure 2Schematic detailing when ATLAS will provide the greatest improvement to taxonomic annotation. Shown is a simple example of a phylogenetic tree with taxonomic information of reference sequences, where the leaves are actual sequences in the database. When a query sequence (yellow stars) has near neighbors in the reference, such as Q1, most algorithms will be able to correctly classify the sequence. However, if a sequence, such as Q2, does not have many near neighbors in the database, computationally expensive phylogenetic methods are required for accurate placement (blue arrows) and annotation. ATLAS captures groups (or partitions) of database sequences (red nodes) that are commonly confused during the annotation process and assigns them to the query sequence (square node for Q1 and diamond nodes for Q2). Black triangles show collapsed portion of the tree. While this schematic is overly simplified and real phylogenies are much more complex, this is illustrating that ATLAS will provide additional information when query sequences do not have near neighbors in the database. This represents ideal cases, where 16S rRNA phylogeny and taxonomic annotations are congruent.
Figure 3ATLAS generates classifications similar to phylogenetic placement methods at an improved speed. Taxonomic labels assigned by TIPP and ATLAS agree at all taxonomic levels for both (A) GEMS and (B) HMP datasets. (C) The ATLAS pipeline adds minimal post-processing time (in seconds) to standard BLAST analyses, but significantly outperforms TIPP.
Comparison between our approach (ATLAS) and a phylogenetic method (TIPP) examining species level assignments. For most query sequences ATLAS assigned partition contains group of species, as it is often impossible to get species-level resolution. Here, we compare how ATLAS performs when TIPP provides species-level classification.
| GEMS | HMP | ||
|---|---|---|---|
| Number of query sequences classified by TIPP at the species level | 13,050 | 10,086 | |
| Number of query sequences assigned to a partition that contained TIPP’s species | 12,847 | 8,999 | |
| Number of query sequences classified at species level by ATLAS that match TIPP’s labeling | 29 | 128 | |
| Number of query sequences classified at species level by ATLAS that did not match TIPP’s labeling | 0 | 85 | |
| Number of query sequences classified at species level by ATLAS but not by TIPP | 18 | 36 | |
(A) For query sequences where ATLAS partitions do not have a species-level MRCA, the assigned partition contains reference sequences that match TIPP’s assigned species. (B) For query sequences where ATLAS partitions do have a species-level MRCA, many of the assigned partitions match TIPP’s classification.
Number of OTUs and partitions in the HMP and GEMS datasets pre and postfiltering.
| HMP | GEMS | ||||
|---|---|---|---|---|---|
| OTU | Partition | OTU | Genus | Partition | |
| Illumina V1-V3 | 454 V1-V2 | ||||
| 2,711 | 992 | ||||
| 43,140 OTUs | 307 partitions and | 26,044 OTUs | 172 genera | 122 partitions and | |
| 36,560 OTUs | 257 partitions and | 10,774 OTUs | 149 genera | 112 partitions and | |
Samples with >1,000 reads were retained for analysis. In the HMP data, features were retained if they had at least 20 total reads or were found in at least 5 samples. In the GEMS data, features were retained if they had at least 20 total reads or were found in at least 10% of case or control samples.
Figure 4ATLAS partitions for HMP and GEMS data typically capture subgenera information. Most partitions have the most recent common ancestor at the genus level for both (A) HMP and (B) GEMS datasets.
Number of OTUs, genera, and ATLAS partitions that are statistically significantly different between moderate-to-severe diarrheal cases and healthy controls.
| OTU | Genus | Partition | |
|---|---|---|---|
| 679 OTUs | 16 genera | 13 partitions and | |
| 1,112 OTUs | 22 genera | 17 partitions and | |
| 8,983 OTUs | 105 genera | 77 partitions and |
Features generated from 3,501,840 GEMS dataset sequences were considered differentially abundant if they had a fold change or odds ratio exceeding 2 in either cases or controls and the statistical association was significant (P < 0.05) after Benjamini-Hochberg correction for multiple testing. Singleton partitions have a single OTU mapped to them. Note that when aggregating at the genus level, 2,411 OTUs and 899,322 sequences had no assignment.
Confusion matrix highlighting the number of shared/unshared statistically significant OTUs and ATLAS partitions.
| OTUs | |||
|---|---|---|---|
| Not Significant | Significant | ||
| 4,557 | 608 | ||
| 4,426 | 1,183 | ||
Features were considered differentially abundant between healthy controls and diarrheal cases if they had a fold change or odds ratio exceeding 2 in either cases or controls and the statistical association was significant (P < 0.05) after Benjamini-Hochberg correction for multiple testing.