| Literature DB >> 35361114 |
Josip Rudar1, Teresita M Porter2, Michael Wright2, G Brian Golding3, Mehrdad Hajibabaei4.
Abstract
BACKGROUND: Identification of biomarkers, which are measurable characteristics of biological datasets, can be challenging. Although amplicon sequence variants (ASVs) can be considered potential biomarkers, identifying important ASVs in high-throughput sequencing datasets is challenging. Noise, algorithmic failures to account for specific distributional properties, and feature interactions can complicate the discovery of ASV biomarkers. In addition, these issues can impact the replicability of various models and elevate false-discovery rates. Contemporary machine learning approaches can be leveraged to address these issues. Ensembles of decision trees are particularly effective at classifying the types of data commonly generated in high-throughput sequencing (HTS) studies due to their robustness when the number of features in the training data is orders of magnitude larger than the number of samples. In addition, when combined with appropriate model introspection algorithms, machine learning algorithms can also be used to discover and select potential biomarkers. However, the construction of these models could introduce various biases which potentially obfuscate feature discovery.Entities:
Keywords: Biomarker selection; Biomonitoring; Ecological assessment; Machine learning; Metabarcoding; Metagenomics
Mesh:
Substances:
Year: 2022 PMID: 35361114 PMCID: PMC8969335 DOI: 10.1186/s12859-022-04631-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1A geometric overview of how LANDMark trees can partition samples. Oblique (straight line) and non-linear (curved line) splits are created linear and neural network models, respectively. Unlike the axis aligned splits used in Random Forests, LANDMark nodes consider multiple features. This can allow each model to take advantage of the additional information and use it to learn more appropriate decision rules. Since multiple models are considered at each node, only those which partition samples into smaller purer regions are selected. Random linear oracles, left, can be used to add additional randomness to LANDMark. This approach selects two points at random without replacement and calculates the midpoint between these samples. This midpoint is then used to find the hypersurface orthogonal to the initial two points. Samples are then partitioned according to side of the hypersurface in which they are found. Following this, different randomized subsets of features and/or a bootstrapped samples of the data are then used to train different supervised models in each node (middle). This process is repeated until a stopping criterion is met (right). Many trees are constructed in this way and their decisions combined to produce the final prediction
Fig. 2A simplified overview of how LANDMark constructs and selects splitting rules at each node. Multiple models are evaluated at each node in each tree. Diversity between LANDMark trees is due to the random selection of training samples and features in each node, random initializations of most models, and the random selection of models which create partitions with equivalent information gain
Default models and the possible parameter choices for each model under different conditions
| Classifier | Non-tunable parameters | Number of training features (≥ 4 features) | Parameters (If number of samples > 6) | Parameters (if number of samples ≤ 6) |
|---|---|---|---|---|
| Logistic regression (LBFGS solver) | max_iter = 2000 penalty = “l2” | Randomly selected—user defined | A grid search using fivefold stratified cross-validation is used to choose C from logarithmically spaced values in the range of 10–4 and 104 | C parameter set to 1.0 |
| Logistic regression (Liblinear solver) | max_iter = 2000 penalty = “l1” | Full | A grid search using fivefold stratified cross-validation is used to choose C from logarithmically spaced values in the range of 10–4 and 104 | C parameter set to 1.0 |
| Linear SVC | max_iter = 2000 | Randomly selected—user defined | A grid search using fivefold stratified cross-validation is used to choose alpha (for SGD classifiers) or C (for the linear SVC). The possible choices for these parameters are 0.001, 0.01, 0.1, 1.0, 10, 100. In the case of the SGD Classifier, the loss function (hinge or modified Huber) is also chosen using 5 cross-validation | C parameter set to 1.0 |
| Stochastic gradient descent classifier (L2 penalty) | max_iter = 2000 | Randomly selected—user defined | A grid search using fivefold stratified cross-validation is used to choose alpha (for SGD classifiers) or C (for the linear SVC). The possible choices for these parameters are 0.001, 0.01, 0.1, 1.0, 10, 100. In the case of the SGD Classifier, the loss function (hinge or modified Huber) is also chosen using 5 cross-validation | Alpha parameter set to 1.0, loss function (hinge or modified Huber) is randomly chosen |
| Stochastic gradient descent classifier (elastic-net penalty) | max_iter = 2000 | Full | A grid search using fivefold stratified cross-validation is used to choose alpha (for SGD classifiers) or C (for the linear SVC). The possible choices for these parameters are 0.001, 0.01, 0.1, 1.0, 10, 100. In the case of the SGD Classifier, the loss function (hinge or modified Huber) is also chosen using 5 cross-validation | Alpha parameter set to 1.0, loss function (hinge or modified Huber) is randomly chosen |
| Ridge regression | NA | Randomly selected—user defined | Alpha chosen from logarithmically spaced values in the range of 10–3 and 104 using generalized cross validation | NA |
| Neural network | batch_size = 32 epochs = 300 validation_split = 0.10 min_delta = 0.0001 patience = 40 See text for architecture details | Randomly selected—user defined | NA | NA |
Overview of the performance of each model in each of the synthetic tests
| LANDMark (oracle) | LANDMark (no oracle) | Extra trees | LinearSVC | Logistic regression | Random forest | Ridge regression | SGD (MH) | SGD (SH) | ||
|---|---|---|---|---|---|---|---|---|---|---|
| 0.805 ± 0.167 (2) | 0.405 ± 0.164 (6)* | 0.672 ± 0.188 (3) | 0.626 ± 0.205 (5) | 0.378 ± 0.225 (8)* | 0.660 ± 0.178 (4) | 0.341 ± 0.146 (9)* | 0.386 ± 0.170 (7)* | |||
| 0.678 ± 0.084 (5.5) | 0.674 ± 0.083 (7) | 0.678 ± 0.060 (5.5) | 0.730 ± 0.067 (2) | 0.691 ± 0.071 (3) | 0.680 ± 0.077 (4) | 0.520 ± 0.312 (9) | 0.526 ± 0.296 (8) | |||
| 50 Features | 0.909 ± 0.112 (2) | 0.713 ± 0.169 (7) | 0.812 ± 0.142 (4) | 0.800 ± 0.147 (5) | 0.714 ± 0.172 (6)* | 0.816 ± 0.150 (3) | 0.565 ± 0.135 (9)* | 0.569 ± 0.174 (8)* | ||
| 0.736 ± 0.052 (7) | 0.746 ± 0.052 (4) | 0.742 ± 0.054 (5) | 0.773 ± 0.040 (2) | 0.652 ± 0.056 (9)* | 0.751 ± 0.036 (3) | 0.736 ± 0.061 (6) | 0.714 ± 0.072 (8) | |||
| 75 Features | 0.959 ± 0.099 (2) | 0.918 ± 0.141 (5) | 0.923 ± 0.124 (4) | 0.874 ± 0.166 (7) | 0.883 ± 0.160 (6) | 0.928 ± 0.116 (3) | 0.663 ± 0.167 (9)* | 0.674 ± 0.143 (8)* | ||
| 0.786 ± 0.045 (5) | 0.786 ± 0.051 (4) | 0.703 ± 0.035 (7)* | 0.806 ± 0.040 (2) | 0.593 ± 0.035 (9)* | 0.790 ± 0.043 (3) | 0.706 ± 0.068 (6)* | 0.685 ± 0.081 (8)* | |||
| Average ranking (Score) | 1 | 2 | 6 | 3.67 | 5.67 | 6.67 | 3.33 | 9 | 7.67 | |
| Average ranking (MCC) | 5.83 | 5 | 4.33 | 3.17 | 1.33 | 7 | 3.33 | 7 | 8 |
The mean and sample standard deviation are reported with the best performing model being indicated in bold text. Asterisks indicate a statistically significant difference (p ≤ 0.05). A single asterisk denotes a statistically significant difference in favor of LANDMark (Oracle) while a double asterisk indicates a statistically significant difference in favor of the alternative model
Overview of the generalization performance (balanced accuracy) of each model when trained using toy data
| LANDMark (oracle) | LANDMark (no oracle) | Extra trees | LinearSVC | Logistic regression | Random forest | Ridge regression | SGD (MH) | SGD (SH) | |
|---|---|---|---|---|---|---|---|---|---|
| Two spirals | 0.935 ± 0.058 (3) | 0.939 ± 0.047 (2) | 0.445 ± 0.080 (9) | 0.491 ± 0.077 (6) | 0.866 ± 0.062 (4) | 0.495 ± 0.064 (5) | 0.487 ± 0.104 (7) | 0.457 ± 0.106 (8) | |
| Concentric circles | 0.844 ± 0.046 (1) | 0.837 ± 0.042 (2) | 0.781 ± 0.049 (3) | 0.364 ± 0.116 (7) | 0.428 ± 0.058 (5) | 0.802 ± 0.064 (4) | 0.425 ± 0.060 (6) | 0.305 ± 0.079 (9) | 0.348 ± 0.082 (8) |
| Parkinson’s disease | 0.718 ± 0.082 (2) | 0.677 ± 0.076 (5) | 0.679 ± 0.086 (4) | 0.693 ± 0.081 (3) | 0.667 ± 0.085 (6) | 0.635 ± 0.080 (7) | 0.554 ± 0.099 (9) | 0.564 ± 0.099 (8) | |
| Iris | 0.926 ± 0.038 (3) | 0.918 ± 0.037 (4) | 0.858 ± 0.076 (7) | 0.890 ± 0.083 (5) | 0.926 ± 0.035 (2) | 0.746 ± 0.041 (9) | 0.874 ± 0.068 (6) | 0.796 ± 0.129 (8) | |
| Breast cancer | 0.940 ± 0.017 (3) | 0.929 ± 0.023 (7) | 0.942 ± 0.014 (2) | 0.938 ± 0.021 (4) | 0.919 ± 0.024 (8) | 0.900 ± 0.024 (9) | 0.929 ± 0.021 (6) | 0.931 ± 0.027 (5) | |
| Wine | 0.974 ± 0.032 (3) | 0.973 ± 0.031 (5) | 0.976 ± 0.032 (2) | 0.967 ± 0.032 (7) | 0.974 ± 0.030 (4) | 0.971 ± 0.031 (6) | 0.965 ± 0.039 (8) | 0.946 ± 0.046 (9) | |
| Seeds | 0.902 ± 0.022 (3) | 0.905 ± 0.021 (2) | 0.877 ± 0.039 (7) | 0.890 ± 0.031 (4) | 0.887 ± 0.022 (5) | 0.849 ± 0.053 (9) | 0.874 ± 0.043 (6) | 0.851 ± 0.074 (8) | |
| Heart failure | 0.621 ± 0.064 (2) | 0.557 ± 0.069 (6) | 0.526 ± 0.111 (8) | 0.588 ± 0.085 (4) | 0.613 ± 0.061 (3) | 0.574 ± 0.068 (5) | 0.390 ± 0.131 (9) | 0.546 ± 0.117 (7) | |
| Cancer coimbra | 0.700 ± 0.107 (2) | 0.690 ± 0.099 (3) | 0.644 ± 0.124 (8) | 0.580 ± 0.127 (4) | 0.656 ± 0.114 (6) | 0.669 ± 0.135 (5) | 0.655 ± 0.121 (7) | 0.623 ± 0.123 (9) | |
| Raisin | 0.839 ± 0.026 (3) | 0.839 ± 0.027 (2) | 0.815 ± 0.031 (8) | 0.827 ± 0.049 (5) | 0.825 ± 0.032 (6) | 0.832 ± 0.034 (4) | 0.819 ± 0.104 (7) | 0.813 ± 0.035 (9) | |
| Two moons | 0.968 ± 0.057 (2) | 0.968 ± 0.052 (3) | 0.761 ± 0.169 (9) | 0.803 ± 0.086 (6) | 0.934 ± 0.081 (4) | 0.817 ± 0.075 (5) | 0.774 ± 0.090 (8) | 0.778 ± 0.146 (7) | |
| HMP | 0.953 ± 0.022 (3) | 0.948 ± 0.019 (7) | 0.936 ± 0.025 (8) | 0.950 ± 0.018 (4) | 0.923 ± 0.033 (9) | 0.949 ± 0.027 (5) | 0.954 ± 0.022 (2) | 0.949 ± 0.026 (6) | |
| Average Ranking | 2.17 | 2.75 | 4.5 | 5.92 * | 4.25 | 5.58 | 5.17 | 7 ** | 7.67 ** |
The mean, sample standard deviation, and rank for each model is reported. The best performing models are highlighted in bold text. Asterisks indicate a statistically significant (p ≤ 0.05) difference between models. A single asterisk indicates a difference in favor of LANDMark (Oracle) while a double asterisk indicates a difference in favor of either LANDMark model
Fig. 3More accurate decision boundaries are recovered using LANDMark models. Decision boundaries discovered by various classifiers on two-spirals dataset. The input data (a) was used to train (b) a single Extremely Randomized Tree, c a single decision tree, d, e two different LANDMark (Oracle) trees that demonstrate the randomness of the algorithm, f a single LANDMark (No Oracle) tree, g an Extremely Randomized Trees classifier consisting of 100 trees, h a Random Forest classifier consisting of 100 trees, i a LANDMark (Oracle) classifier consisting of 64 trees, and j a LANDMark (No Oracle) classifier consisting of 64 trees. Solid circles indicate data points used for training while crosses represent validation data. The balanced accuracy of each classifier is reported as the score. The shading in each plot is a qualitative representation of how confidently each model would predict the class of a particular sample. In panels b-f the red and blue regions are not shaded and represent where each model will predict either the red or blue spiral while in panels g–j the predictions of each ensemble member are averaged. In these panels darker regions represent areas where the prediction from each model is more confident
Fig. 4Principal Coordinate Analysis projections of test data can be used to assess model fit. Proximity matrices extracted from the Extremely Randomized Trees, Random Forest, LANDMark (Oracle), and LANDMark (No Oracle) models trained on the Wisconsin Breast Cancer dataset. Higher amounts of explained variance along the first principal component, relative to other models, reflect the ability of a model to identify a set of simple decision pathways capable of classifying samples. Higher explained variance along additional components (relative to the other models) suggest the presence of complex decision pathways and overfitting
Overview of the generalization performance (balanced accuracy) of each model when trained using metabarcoding data
| Amplicon | LANDMark (Oracle) | LANDMark (No Oracle) | Extra Trees | Linear SVC | Logistic Regression | Random Forest | Ridge Regression | SGD (MH) | SGD (SH) |
|---|---|---|---|---|---|---|---|---|---|
| F230 | 0.957 ± 0.061 (2)a | 0.940 ± 0.061 (6) | 0.944 ± 0.071 (5) | 0.931 ± 0.073 (7) | 0.951 ± 0.066 (3) | 0.944 ± 0.069 (4) | 0.922 ± 0.077 (8) | 0.919 ± 0.079 (9) | |
| BE | 0.963 ± 0.049 (7.5) | 0.967 ± 0.048 (5) | 0.976 ± 0.043 (2) | 0.970 ± 0.0054 (4)a | 0.970 ± 0.047 (3) | 0.963 ± 0.049 (7.5) | 0.963 ± 0.055 (6) | 0.934 ± 0.062 (9) |
The average generalization performance and standard deviation (measured using balanced accuracy) for each classification model was calculated using test data after training each classification model on data derived from either the F230 or the BE amplicons
The best performing results are in bold
aTruncating to three significant digits resulted in the score rounding up to the nearest thousandth
Comparison of classifier generalization performance using the BE dataset before and after recursive feature elimination
| LANDMark (oracle) | LANDMark (No oracle) | Extra trees | LinearSVC | Logistic regression | Random forest | Ridge regression | SGD (MH) | SGD (SH) | |
|---|---|---|---|---|---|---|---|---|---|
| LANDMark (oracle) | 0.971 ± 0.002 0.963 ± 0.004 | 0.001 ± 0.001 | − 0.005 ± 0.004 | 0.002 ± 0.004 | − 0.013 ± 0.004 | − 0.008 ± 0.004 | − 0.006 ± 0.002 | − 0.026 ± 0.006* | − 0.018 ± 0.006 |
| LANDMark (no oracle) | 0.003 ± 0.002 | 0.971 ± 0.004 0.960 ± 0.004 | − 0.006 ± 0.004 | − 0.003 ± 0.004 | − 0.013 ± 0.004 | − 0.009 ± 0.004 | − 0.007 ± 0.002 | − 0.027 ± 0.006* | − 0.023 ± 0.005* |
| Extra trees | 0.009 ± 0.004 | 0.006 ± 0.004 | 0.966 ± 0.004 0.954 ± 0.005 | 0.003 ± 0.004 | − 0.008 ± 0.004 | − 0.004 ± 0.003 | − 0.002 ± 0.004 | − 0.022 ± 0.005* | − 0.023 ± 0.005* |
| LinearSVC | 0.016 ± 0.003** | − 0.013 ± 0.004** | 0.007 ± 0.005 | 0.968 ± 0.003 0.947 ± 0.004 | − 0.0118 ± 0.004 | 0.006 ± 0.005 | − 0.004 ± 0.004 | − 0.024 ± 0.006* | − 0.021 ± 0.005* |
| Logistic regression | 0.017 ± 0.005** | 0.014 ± 0.004** | 0.008 ± 0.005 | 0.001 ± 0.005 | 0.958 ± 0.004 0.946 ± 0.005 | 0.004 ± 0.004 | 0.006 ± 0.003 | − 0.014 ± 0.006 | − 0.010 ± 0.006 |
| Random forest | 0.003 ± 0.004 | 0.000 ± 0.003 | − 0.006 ± 0.003 | − 0.013 ± 0.004 | − 0.015 ± 0.004 | 0.962 ± 0.004 0.960 ± 0.004 | 0.002 ± 0.004 | − 0.018 ± 0.006 | − 0.014 + 0.006 |
| Ridge regression | − 0.003 ± 0.004 | − 0.006 ± 0.004 | − 0.012 ± 0.005 | − 0.019 ± 0.004* | − 0.020 ± 0.006* | − 0.006 ± 0.005 | 0.964 ± 0.003 0.966 ± 0.003 | − 0.020 ± 0.006 | − 0.016 ± 0.006 |
| SGD (MH) | 0.030 ± 0.005** | 0.027 ± 0.005 | 0.021 ± 0.006** | 0.014 ± 0.006 | 0.012 ± 0.006 | 0.027 ± 0.006** | 0.033 ± 0.006** | 0.944 ± 0.005 0.933 ± 0.005 | 0.004 ± 0.007 |
| SGD (SH) | 0.062 ± 0.013** | 0.059 ± 0.013 | 0.053 ± 0.014** | 0.046 ± 0.014 | 0.044 ± 0.013 | 0.059 ± 0.013** | 0.064 ± 0.012** | 0.032 ± 0.006** | 0.948 ± 0.005 0.901 ± 0.012 |
Models were trained using data from the BE amplicon and generalization performance was measured using the balanced accuracy score. The mean performance of each model before and after recursive feature elimination can be found along the main diagonal. The upper triangle reflects the difference of means between each comparison before recursive feature elimination while the bottom triangle reflects differences in means after recursive feature elimination. A single asterisk is used to represent a statistically significant difference (p ≤ 0.05) in generalization performance which favors classifiers along the rows while statistically significant differences favoring the classifiers along the columns are represented using a double asterisk. The mean and standard error are reported
Comparison of classifier generalization performance using the F230 dataset before and after recursive feature elimination
| LANDMark (oracle) | LANDMark (no oracle) | Extra trees | LinearSVC | Logistic regression | Random forest | Ridge regression | SGD (MH) | SGD (SH) | |
|---|---|---|---|---|---|---|---|---|---|
| LANDMark (oracle) | 0.932 ± 0.005 0.933 ± 0.006 | 0.001 ± 0.003 | − 0.008 ± 0.005 | − 0.016 ± 0.007 | − 0.005 ± 0.005 | − 0.009 ± 0.006 | − 0.008 ± 0.004 | − 0.017 ± 0.007 | − 0.037 ± 0.007* |
| LANDMark (no oracle) | − 0.001 ± 0.004 | 0.933 ± 0.006 0.935 ± 0.005 | − 0.009 ± 0.005 | − 0.016 ± 0.007 | − 0.006 ± 0.004 | − 0.10 ± 0.006 | − 0.008 ± 0.004 | − 0.018 ± 0.007 | − 0.038 ± 0.007* |
| Extra trees | 0.013 ± 0.006 | 0.014 ± 0.005 | 0.924 ± 0.005 0.921 ± 0.005 | − 0.008 ± 0.007 | 0.002 ± 0.005 | − 0.002 ± 0.005 | 0.000 ± 0.005 | − 0.009 ± 0.007 | − 0.029 ± 0.008* |
| LinearSVC | 0.007 ± 0.006 | 0.008 ± 0.004 | − 0.006 ± 0.007 | 0.916 ± 0.007 0.926 ± 0.005 | 0.010 ± 0.006 | 0.006 ± 0.009 | 0.008 ± 0.006 | − 0.001 ± 0.009 | − 0.021 ± 0.008 |
| Logistic regression | 0.012 ± 0.005 | 0.014 ± 0.004 | 0.000 ± 0.005 | 0.005 ± 0.006 | 0.927 ± 0.005 0.921 ± 0.004 | − 0.004 ± 0.005 | − 0.002 ± 0.004 | − 0.012 ± 0.006 | − 0.032 ± 0.007* |
| Random forest | 0.011 ± 0.004 | 0.013 ± 0.005 | − 0.001 ± 0.004 | 0.004 ± 0.007 | − 0.001 ± 0.004 | 0.922 ± 0.006 0.922 ± 0.006 | 0.002 ± 0.005 | − 0.007 ± 0.007 | − 0.027 ± 0.010 |
| Ridge regression | 0.006 ± 0.004 | 0.008 ± 0.004 | − 0.006 ± 0.005 | − 0.001 ± 0.007 | − 0.006 ± 0.005 | 0.005 ± 0.005 | 0.924 ± 0.005 0.927 ± 0.006 | − 0.009 ± 0.006 | − 0.029 ± 0.007* |
| SGD (MH) | 0.037 ± 0.009** | 0.038 ± 0.008 | 0.024 ± 0.010 | 0.030 ± 0.008** | 0.025 ± 0.008 | 0.031 ± 0.008** | 0.009 ± 0.006 | 0.915 ± 0.008 0.896 ± 0.008 | − 0.020 ± 0.010 |
| SGD (SH) | 0.063 ± 0.012** | 0.064 ± 0.011** | 0.050 ± 0.010** | 0.055 ± 0.012** | 0.050 ± 0.011** | 0.056 ± 0.010** | 0.029 ± 0.007** | 0.020 ± 0.010 | 0.895 ± 0.008 0.871 ± 0.10 |
Models were trained using data from the F230 amplicon and generalization performance was measured using the balanced accuracy score. The mean performance of each model before and after recursive feature elimination can be found along the main diagonal. The upper triangle reflects the difference of means between each comparison before recursive feature elimination while the bottom triangle reflects differences in means after recursive feature elimination. A single asterisk is used to represent a statistically significant difference (p ≤ 0.05) in generalization performance which favors classifiers along the rows while statistically significant differences favoring the classifiers along the columns are represented using a double asterisk. The mean and standard error are reported
Fig. 5The sets of predictive ASVs selected by different classification models substantially overlap. Venn diagrams illustrating the amount of overlap in the set of ASVs (left), genera (middle), and families (right) from the Wood Buffalo F230 dataset which correspond to the fold containing the best LANDMark (Oracle) model. The top row is the comparison between the LANDMark (Oracle) and the Linear Support Vector Machine classifier (top) while the middle row compares the LANDMark (Oracle) and the Random Forest classifier. The bottom row compares LANDMark (Oracle) to an indicator species analysis
Fig. 6The top 20 ASVs, selected using recursive feature elimination and LANDMark, from the F230 subset. These ASVs are used by LANDMark to help identify the Athabasca or Peace River Delta. The KernelSHAP method was used to calculate the SHAP values for each ASV in each sample. Each point is a sample and the color of each point reflects the presence (pink) or absence (blue) of the ASV listed along the y-axis. The higher the absolute value of a sample’s score for a particular ASV along the x-axis, the more strongly that ASV shifts the prediction of a sample. Positive SHAP values push the prediction towards the Athabasca River Delta; negative SHAP values push prediction towards the Peace River
Fig. 7Pairwise comparisons illustrating the relationship between average error and inter-rater agreement within decision tree ensembles. Kappa-Error diagrams visualize the relationship between pairwise error (y-axis) and inter-rater reliability (x-axis) for each base estimator in LANDMark, Extremely Randomized Trees, and Random Forests. Most LANDMark trees tend to occupy the lower-right area of each plot, indicating pairs of high accuracy trees that tend to agree on classifications