| Literature DB >> 34498030 |
Elliott Gordon-Rodriguez1, Thomas P Quinn2, John P Cunningham1.
Abstract
MOTIVATION: The automatic discovery of sparse biomarkers that are associated with an outcome of interest is a central goal of bioinformatics. In the context of high-throughput sequencing (HTS) data, and compositional data (CoDa) more generally, an important class of biomarkers are the log-ratios between the input variables. However, identifying predictive log-ratio biomarkers from HTS data is a combinatorial optimization problem, which is computationally challenging. Existing methods are slow to run and scale poorly with the dimension of the input, which has limited their application to low- and moderate-dimensional metagenomic datasets.Entities:
Year: 2021 PMID: 34498030 PMCID: PMC8696089 DOI: 10.1093/bioinformatics/btab645
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Evaluation metrics shown for each method, averaged over 25 datasets × 20 random train/test splits
| Runtime (s) | Active inputs (%) | Accuracy (%) | AUC (%) | F1 (%) | |
|---|---|---|---|---|---|
| CoDaCoRe—Balances (ours) |
|
| 75.2 ± 2.4 | 79.5 ± 2.6 | 73.7 ± 2.6 |
| CoDaCoRe—Amalgamations (ours) |
| 1.9 ± 0.3 | 71.8 ± 2.4 | 74.5 ± 2.8 | 69.8 ± 2.9 |
| selbal ( | 79 033.7 ± 2094.1 | 2.4 ± 0.2 | 61.2 ± 1.9 | 80.0 ± 2.4 | 70.9 ± 1.1 |
| Pairwise log-ratios ( | 14 207.0 ± 1038.4 | 2.5 ± 0.4 | 73.3 ± 1.7 | 75.2 ± 2.4 | 67.8 ± 3.0 |
| Lasso |
| 4.4 ± 0.6 | 72.4 ± 1.7 | 75.2 ± 2.3 | 65.2 ± 3.7 |
| CoDaCoRe—balances with |
| 6.1 ± 0.7 |
|
|
|
| Coda-lasso ( | 1043.0 ± 55.4 | 19.7 ± 2.7 | 72.5 ± 2.3 | 78.0 ± 2.4 | 64.2 ± 4.4 |
| amalgam ( | 7360.5 ± 209.8 | 87.6 ± 2.1 | 74.4 ± 2.5 | 78.2 ± 2.7 | 73.9 ± 2.8 |
| DeepCoDA ( | 296.5 ± 21.4 | 89.3 ± 0.6 | 70.6 ± 2.9 | 77.6 ± 2.9 | 64.7 ± 7.4 |
| CLR-lasso ( |
| 100.0 ± 0.0 | 77.5 ± 1.8 | 81.6 ± 2.2 | 75.8 ± 2.7 |
| Random Forest | 10.6 ± 0.4 | – | 78.0 ± 2.2 | 82.2 ± 2.2 | 77.3 ± 2.5 |
| Log-ratio lasso ( | 135.0 ± 11.1 | 0.7 ± 0.0 | 72.0 ± 2.4 | 76.4 ± 2.3 | 69.2 ± 2.7 |
Note: Standard errors are computed independently on each dataset, and then averaged over the 25 datasets. The models are ordered by sparsity, i.e. percentage of active input variables. CoDaCoRe (with balances) is the only learning algorithm that is simultaneously fast, sparse and accurate. The penultimate row shows the performance of Random Forest, a powerful black-box classifier which can be thought of as providing an approximate upper bound on the predictive accuracy of any interpretable model. The bottom row is shown separately and marked with an asterisk because the corresponding algorithm failed to converge on 432 out our 500 runs (averages were taken after imputing these missing values with the corresponding values obtained with pairwise log-ratios, which is the most similar method). We highlight in bold the CoDa models that are fast to run, as well as the CoDa models that are most sparse and accurate.
Fig. 1.Gain in classification accuracy (relative to the “majority vote” baseline classifier) plotted against runtime. Each point represents one of 25 datasets, with size proportional to the input dimension. Note the x-axis is drawn on the log-scale. CoDaCoRe (with balances) is the only method that scales effectively to our larger datasets, while consistently achieving high predictive accuracy. Moreover, its performance is broadly consistent across smaller and larger datasets
Evaluation metrics for the liquid biopsy data (Best ), averaged over 20 independent 80/20 train/test splits
| Runtime (s) | Vars (#) | Acc. (%) | AUC (%) | F1 (%) | |
|---|---|---|---|---|---|
| CoDaCoRe | 31±2.2 |
|
| 93.6 ± 2.6 |
|
| Lasso | 23±0.2 | 22 ± 4 | 87.8 ± 1.3 | 94.7 ± 1.5 | 92.7 ± 0.7 |
| RF | 383±8.6 | – | 89.0 ± 1.6 | 94.1 ± 1.8 | 93.1 ± 1.0 |
| XGBoost | 108±1.6 | – | 90.6 ± 1.9 |
| 94.1 ± 1.1 |
Note: CoDaCoRe (with balances) achieves equal predictive accuracy as competing methods, but with much sparser solutions. Note that sparsity is expressed as an (integer) number of active variables in the model (not as a percentage of the total, as was done in Table 1). We highlight in bold the sparsest and most accurate models.
Fig. 2.CoDaCoRe variable selection for the first (most explanatory) log-ratio on the Crohn disease data (Rivera-Pinto ). For each of 10 independent bootstraps of the training set (80% of the data randomly sampled with stratification by case–control), we show which variables are selected in the numerator (blue) and denominator (orange) of the balance. CoDaCoRe learns remarkably consistent log-ratios across independent training sets