| Literature DB >> 35638356 |
Igor Sfiligoi1, George Armstrong2,3,4, Antonio Gonzalez3, Daniel McDonald3, Rob Knight3,4,5,6.
Abstract
UniFrac is an important tool in microbiome research that is used for phylogenetically comparing microbiome profiles to one another (beta diversity). Striped UniFrac recently added the ability to split the problem into many independent subproblems, exhibiting nearly linear scaling but suffering from memory contention. Here, we adapt UniFrac to graphics processing units using OpenACC, enabling greater than 1,000× computational improvement, and apply it to 307,237 samples, the largest 16S rRNA V4 uniformly preprocessed microbiome data set analyzed to date. IMPORTANCE UniFrac is an important tool in microbiome research that is used for phylogenetically comparing microbiome profiles to one another. Here, we adapt UniFrac to operate on graphics processing units, enabling a 1,000× computational improvement. To highlight this advance, we perform what may be the largest microbiome analysis to date, applying UniFrac to 307,237 16S rRNA V4 microbiome samples preprocessed with Deblur. These scaling improvements turn UniFrac into a real-time tool for common data sets and unlock new research questions as more microbiome data are collected.Entities:
Keywords: GPU; OpenACC; UniFrac; microbiome; optimization
Year: 2022 PMID: 35638356 PMCID: PMC9239203 DOI: 10.1128/msystems.00028-22
Source DB: PubMed Journal: mSystems ISSN: 2379-5077 Impact factor: 7.324
FIG 1Optimized UniFrac. (A) Sparsity of microbiome data as a function of the number of samples stratified by environments with at least 1,000 samples (307k sample set), representing 92.37% of the total number of samples. (B) Runtime of optimized UniFrac using the 307k sample set. Black bars represent CPUs, and white bars show GPUs. (C) The proportion of the total convex hull volume from computed over principal coordinates for environments with at least 1,000 samples (307k sample set; volumes obtained by randomly selecting 1,000 samples from each environment 10 times and computing the convex hull volume over the first three principal coordinates for those samples, normalized to the total convex hull volume of all 307k samples). (D) A principal coordinates plot of the first two axes from 307k public and anonymized private 16S rRNA V4 samples from Qiita, colored by the Earth Microbiome Project Ontology level 3. An interactive version of the plot can be accessed at https://bit.ly/unifrac-pcoa-307k.
Speedups on the EMP data set relative to a few different architectures for unweighted UniFrac
| Platform | RAM (GB) | Runtime (min) | Speedup | GPU speedup | Mobile speedup |
|---|---|---|---|---|---|
| Original CPU Xeon Gold 6242 | 5.6 | 504 | 1× | ||
| CPU Mobile i7-8565U | 8.1 | 28.2 | 18× | 1× | |
| CPU Mobile i7-8850H | 8.1 | 18.7 | 27× | 1.5× | |
| CPU Xeon Gold 6242 | 8.1 | 4.8 | 105× | 1× | |
| GPU Mobile GTX 1050 Max-Q | 6.6 | 3.8 | 170× | 1.3× | 7.4× |
| GPU T4 | 7.8 | 1.5 | 340× | 3.2× | |
| GPU RTX2080TI | 8.4 | 0.73 | 690× | 6.6× | |
| GPU V100 PCIE 32GB | 8.2 | 0.75 | 670× | 6.4× | |
| GPU A100 PCIE 40GB | 7.8 | 0.62 | 810× | 7.7× | |
| GPU RTX3090 | 8.4 | 0.53 | 950× | 9.0× | |
| GPU RTX8000 | 7.8 | 0.48 | 1,050× | 10.0× |
Speedup is relative to performance on the same data using Striped UniFrac from McDonald et al. (10). In all cases, all available compute resources for an architecture were utilized. Peak resident memory for the runs is provided.
Speedups on the 113k data set relative to a few different architectures for unweighted UniFrac
| Platform | RAM (GB) | Runtime (h) | Speedup | GPU speedup | No. of chunks |
|---|---|---|---|---|---|
| Original CPU Xeon Gold 6242 | 5.5 | 498 | 1× | 36 | |
| CPU Mobile i7-8850H | Not collected | 10 | 50× | 12 | |
| CPU Xeon Gold 6242 | 148 | 3 | 166× | 1× | 1 |
| GPU Mobile GTX 1050 Max-Q | 3.6 | 3 | 166× | 1× | 36 |
| GPU T4 | 38 | 0.68 | 730× | 4.4× | 4 |
| GPU RTX2080TI | 27 | 0.32 | 1,560× | 9.4× | 6 |
| GPU V100 PCIE 32GB | 75 | 0.22 | 2,260× | 13.6× | 2 |
| GPU RTX3090 | 51 | 0.19 | 2,600× | 15.8× | 3 |
Speedup is relative to performance on the same data using Striped UniFrac from McDonald et al. (10). In all cases, all available compute resources for an architecture were utilized. Peak resident memory for the runs is provided; however, the amount of maximum memory used for processing is a function of how many chunks are processed at one time. The largest memory use comes from creating the distance matrix that is N2 to the number of samples (not shown) and is effectively invariant to the architecture.
Speedups on the 307k data set relative to a few different architectures for unweighted UniFrac
| Platform | RAM (GB) | Runtime (h) | Speedup | GPU speedup | No. of chunks |
|---|---|---|---|---|---|
| Original CPU Xeon Gold 6242 | 7 | 8,326 | 1× | 482 | |
| CPU Xeon Gold 6242 | 184 | 33.1 | 252× | 1x | 6 |
| GPU T4 | 38 | 6.9 | 1,200× | 4.8× | 30 |
| GPU RTX2080TI | 33 | 3.3 | 2,530× | 10x | 36 |
| GPU V100 PCIE 32GB | 85 | 1.93 | 4,300× | 17.2× | 13 |
| GPU RTX3090 | 47 | 1.97 | 4,200× | 16.8× | 24 |
Speedup is relative to performance on the same data using Striped UniFrac from McDonald et al. (10). In all cases, all available compute resources for an architecture were utilized. Peak resident memory for the runs is provided; however, the amount of maximum memory used for processing is a function of how many chunks are processed at one time. The largest memory use comes from creating the distance matrix that is N2 to the number of samples (not shown) and is effectively invariant to the architecture.