| Literature DB >> 34479875 |
George Armstrong1,2,3, Kalen Cantrell2, Shi Huang1,2, Daniel McDonald1, Niina Haiminen4, Anna Paola Carrieri5, Qiyun Zhu6,7, Antonio Gonzalez1, Imran McGrath2,8, Kristen L Beck9, Daniel Hakim1,3, Aki S Havulinna10,11, Guillaume Méric12,13, Teemu Niiranen10,14,15, Leo Lahti16, Veikko Salomaa10, Mohit Jain2,17,18, Michael Inouye12,19, Austin D Swafford2, Ho-Cheol Kim9, Laxmi Parida4, Yoshiki Vázquez-Baeza2, Rob Knight1,2,20,21.
Abstract
The number of publicly available microbiome samples is continually growing. As data set size increases, bottlenecks arise in standard analytical pipelines. Faith's phylogenetic diversity (Faith's PD) is a highly utilized phylogenetic alpha diversity metric that has thus far failed to effectively scale to trees with millions of vertices. Stacked Faith's phylogenetic diversity (SFPhD) enables calculation of this widely adopted diversity metric at a much larger scale by implementing a computationally efficient algorithm. The algorithm reduces the amount of computational resources required, resulting in more accessible software with a reduced carbon footprint, as compared to previous approaches. The new algorithm produces identical results to the previous method. We further demonstrate that the phylogenetic aspect of Faith's PD provides increased power in detecting diversity differences between younger and older populations in the FINRISK study's metagenomic data.Entities:
Mesh:
Year: 2021 PMID: 34479875 PMCID: PMC8559715 DOI: 10.1101/gr.275777.121
Source DB: PubMed Journal: Genome Res ISSN: 1088-9051 Impact factor: 9.043
Figure 1.Partially aggregating branch lengths reduces the space complexity of the algorithm. (A) Faith's PD calculation depends on the representation of features present in samples. In the table, the letters (R, O, B, K) represent samples and the numbers (0, 1, 2, 4, 6, 9, 10) represent features. A “1” in an entry indicates the presence of a feature in the sample. SFPhD uses sparse table data structures, which reduce memory by only keeping track of the nonzero values in a matrix (highlighted in gray). (B) A mock reference phylogenetic tree is shown, with the features from A as tips. Labels for the samples from A are located next to tips that they contain. The nodes are labeled by their order in a postorder traversal of the tree. (C) Graphic depiction of the reference implementation's calculation of Faith's PD by first aggregating the presence/absence information for each branch in the tree, followed by multiplication by the branch lengths to get the metric constituents, and finally a sum over the entire branch × metric constituent table. (D) Graphic representation of the execution of SFPhD. On the left, the stack of presence/absence information is shown at three points during the algorithm's execution (i, ii, iii). Each of these times shows the stack immediately before memory is freed. On the right, the state of the partially aggregated phylogenetic diversity (PD) is shown after each node is added to the stack. Each row represents the vector after a step in the algorithm. In practice, there is only one such vector. (E) The balanced parentheses’ representation for the phylogenetic tree from B.
Figure 2.SFPhD outperforms the reference implementation in terms of runtime and memory usage. (A) Runtime in seconds for computing Faith's PD on data sets with thousands of samples and 100,000 tips in the phylogeny. Data are independently subsampled from a collection of 113,721 public samples in Qiita (Gonzalez et al. 2018; Zhu et al. 2019) as previously processed (McDonald et al. 2018b). Mean of n = 10 repetitions with 95% CI error bars. (B) Memory usage for the same experiment as in A. For both A and B, jobs were terminated if they exceeded 250 GB of memory.
Figure 3.Phylogenetic diversity provides increased statistical power to differentiate age groups in shotgun metagenomics but not in 16S rRNA sequencing. (A) Statistical power to differentiate young adults from old adults in two alpha diversity metrics at different sample sizes using 16S rRNA sequencing in the FINRISK cohort. (B) Same as A but for shallow shotgun metagenomic sequencing.
Figure 4.Phylogenetic tree colored by age-group log of the likelihood ratio of older to younger adults per node. (A) Distribution of Faith's PD by age group on the full data set. (B) Web of Life (WoL) phylogenetic tree with branches colored by the log of likelihood ratio of old adults compared to young adults in descendants of the branch, for the FINRISK data set. The inner circle is colored by the log of likelihood ratio of older adults compared to younger adults in the tips of the tree. The outer circle is colored by the phylum of the taxon represented by each tree tip. Red ellipses mark two clades enriched for samples from older individuals.