| Literature DB >> 18840793 |
William R Swindell1, James M Harper, Richard A Miller.
Abstract
Prediction of individual life span based on characteristics evaluated at middle-age represents a challenging objective for aging research. In this study, we used machine learning algorithms to construct models that predict life span in a stock of genetically heterogeneous mice. Life-span prediction accuracy of 22 algorithms was evaluated using a cross-validation approach, in which models were trained and tested with distinct subsets of data. Using a combination of body weight and T-cell subset measures evaluated before 2 years of age, we show that the life-span quartile to which an individual mouse belongs can be predicted with an accuracy of 35.3% (+/-0.10%). This result provides a new benchmark for the development of life-span-predictive models, but improvement can be expected through identification of new predictor variables and development of computational approaches. Future work in this direction can provide tools for aging research and will shed light on associations between phenotypic traits and longevity.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18840793 PMCID: PMC2693389 DOI: 10.1093/gerona/63.9.895
Source DB: PubMed Journal: J Gerontol A Biol Sci Med Sci ISSN: 1079-5006 Impact factor: 6.053
Predictor Variables.
| Variable ID | Age (Months) | Description | ||
|---|---|---|---|---|
| T-cell subsets | ||||
| CD3_8 | 8 | Total T-cell marker, as a proportion of peripheral blood lymphocytes | ||
| CD3_18 | 18 | As above | ||
| CD4_8 | 8 | CD3+, CD4+, helper T cells, as a proportion of CD3 cells | ||
| CD4_18 | 18 | As above | ||
| CD4M_8 | 8 | CD4+, CD44high memory CD4 cells, as a proportion of CD4 cells | ||
| CD4M_18 | 18 | As above | ||
| CD4P_8 | 8 | CD4+ cells expressing P-glycoprotein, as a proportion of CD4 cells | ||
| CD4P_18 | 18 | As above | ||
| CD4V_8 | 8 | CD4+, CD45RBlow naive CD4 cells, as a proportion of CD4 cells | ||
| CD4V_18 | 18 | As above | ||
| CD8_8 | 8 | CD3+, CD8+, killer T cells, as a proportion of CD3 cells | ||
| CD8_18 | 18 | As above | ||
| CD8M_8 | 8 | CD8+, CD44high memory CD8 cells, as a proportion of CD8 cells | ||
| CD8M_18 | 18 | As above | ||
| CD8P_8 | 8 | CD8+ cells expressing P-glycoprotein, as a proportion of CD8 cells | ||
| CD8P_18 | 18 | As above | ||
| Hormones | ||||
| T4_4 | 4 | Serum thyroxine (μg/dL) | ||
| T4_15 | 15 | As above | ||
| LEP_4 | 4 | Serum leptin (ng/mL) | ||
| LEP_15 | 15 | As above | ||
| IGF_4 | 4 | Serum IGF-I (ng/mL) | ||
| IGF_15 | 15 | As above | ||
| Body weight | ||||
| W8 | 8 | Body weight | ||
| W10 | 10 | As above | ||
| W12 | 12 | As above | ||
| W18 | 18 | As above | ||
| Other | ||||
| LitSize | N/A | No. of pups in litter | ||
| Cat18 | 18 | Cataract score, corrected for secular trend | ||
| Cat24 | 24 | As above | ||
| Gender1 | N/A | Indicator variable defined as 1 if sex = male, 0 otherwise | ||
| Gender2 | N/A | Indicator variable defined as 1 if sex = female, 0 otherwise | ||
Note: Table lists the age at which data were obtained and provides a brief description of each variable. See (5–9) and (11) for further description of variables.
Figure 1.Simple and complex relationships between predictor variables and life span. Each plot displays data for 12 individuals in relation to scores on two predictors. The four symbol types represent individuals associated with each life-span quartile. A, Simple relationship between predictors and life span. Individuals associated with different life-span quartiles are distinguished by simple linear decision boundaries (dotted lines). Simple learning algorithms may perform well in comparison to complex algorithms. B, Complex relationship between predictors and life span. Individuals associated with different life-span quartiles can only be distinguished through recognition of irregular decision regions. Complex learning algorithms are required for accurate prediction of life span based on the two predictors
Algorithm Performance Summary.
| Method | 31 Predictors | 12 Predictors | 6 Predictors |
|---|---|---|---|
| Nearest Shrunken Centroid | 32.86 | 33.61 | 34.37 |
| Stabilized Linear Discriminant Analysis | 30.50 | 34.06 | 32.30 |
| Support Vector Machine | 29.90 | 34.03 | 33.57 |
| Gaussian Process | 30.29 | 32.81 | 33.90 |
| Conditional Inference Tree Forest | 29.88 | 33.49 | 32.65 |
| Random Forest | 30.18 | 33.34 | 29.36 |
| Support Vector Machine | 29.56 | 33.27 | 31.54 |
| Nearest Centroid | 30.86 | 33.16 | 32.43 |
| Localized Linear Discriminant Analysis | 27.29 | 32.44 | 32.06 |
| Naive Bayes | 29.14 | 32.19 | 31.16 |
| Projection Pursuit Linear Discriminant Analysis Tree | 28.45 | 31.93 | 31.28 |
| Linear Discriminant Analysis | 27.03 | 31.86 | 31.16 |
| Multinomial Logistic Regression | 27.02 | 31.78 | 31.14 |
| Stump Decision Trees | 30.46 | 30.50 | 31.15 |
| Artificial Neural Network | 27.94 | 30.50 | 30.90 |
| Binary Decision Trees | 30.06 | 30.18 | 30.87 |
| Conditional Inference Tree | 30.03 | 29.60 | 30.55 |
| K-Nearest Neighbor | 26.75 | 29.11 | 28.31 |
| C4.5 Decision Tree | 27.43 | 28.03 | 29.07 |
| Part Decision Tree | 27.16 | 28.99 | 28.60 |
| Simple Linear Regression | 27.04 | 28.79 | 27.04 |
| Ripper Rule Learner | 25.82 | 26.10 | 25.43 |
| Random Guessing | 24.99 | 25.02 | 25.01 |
Notes: Life-span quartile prediction accuracy was evaluated for 22 machine learning algorithms using all 31 predictor variables, the 12 most important predictors, and the 6 most important predictors. Variable importance was determined based on the Random Forest algorithm (see Figure 4). Algorithms are ranked according their best overall performance among the three predictor variable subsets. For each listed value, accuracy is based on 10,000 simulations in which 664 mice (90%) were randomly selected as training data and used in model construction, with 77 mice (10%) used as testing data for model evaluation (see Methods). For each simulation, the average number of correct life-span quartile predictions among 77 testing set mice was determined. Table lists the average percent accuracy obtained among all 10,000 simulations (95% confidence intervals are approximately ± 0.10%). The R package and function used to implement each algorithm is given in brackets [package, function].
*Algorithm of Tibshirani and colleagues (29). Similar to Nearest Centroid approach, except class centroids are standardized and “shrunk” toward an overall centroid before evaluation of test cases. [pamr, pamr.train]
†Left-spherically distributed linear scores are derived from predictor variables following the dimensionality reduction rule of Laeuter and colleagues (30). Linear scores are used as inputs for standard linear discriminant analysis. [ipred, slda]
‡Predictor variables are mapped to a higher dimensional space using a specified kernel function. A linear hyperplane is identified with the largest possible margin between the two classes to be distinguished, and this “maximal margin” hyperplane is used to classify test cases. The listed accuracies were obtained using a polynomial kernel function, with model parameters chosen by searching possible values and identifying those that minimize prediction errors on the training data. An introduction to support vector machines is provided by Byvatov and Schneider (31). [e1071, svm]
§Training data are modeled as a Gaussian Process, with mean and covariance functions partly determined by parameters estimated during model training. Within this framework, class probabilities associated with each test case are estimated, as described by Williams and Barber (32). Density estimation was performed using the radial basis kernel. [kernlab, gausspr]
‖Similar to Random Forest, except conditional inference trees are used as a base classifier. Listed accuracy obtained using 100 decision trees per forest. [party, cforest]
¶Algorithm of Breiman (33). Predictor variables and training examples are randomly selected to construct a “forest” of decision trees. Test cases are then classified by a voting procedure among all trees in the forest. Listed accuracy was obtained by growing 1000 decision trees per simulation. [randomForest, randomForest]
**Implements support vector training method of Platt (34). [RWeka, SMO]
††Centroids are computed for each class using the training data. Test cases are then assigned the class label of the most similar centroid. [klaR, nm]
‡‡Implements the linear discriminant analysis approach proposed by Tutz and Binder (35). [klaR, loclda]
§§A probability density function is estimated for each class based on training data. For each test case, the estimated density is used to compute the probability of class membership for each class (assuming conditional independence of class-conditional probabilities). Test cases are assigned to the class for which its class-conditional probability is greatest. [klaR, NaiveBayes]
‖ ‖Decision tree in which projection pursuit linear discriminant functions are used as attributes at each node (36). [classPP, PP.tree]
¶¶Linear combinations of predictor variables are constructed to maximize the ratio of variation between classes versus the variance within classes. This provides a subspace of predictors with lower dimensionality that is used for classification of test cases. [klaR, lda]
***Multinomial logistic regression models based on ridge regression. See le Cessie and van Houwelingen (37). [RWeka, Logistic]
†††Binary decision trees are constructed using only one predictor variable. [Rweka, DecsionStump]
‡‡‡A neural network with one hidden layer and four output nodes (one for each class); the number of input nodes equals the number of predictor variables. During the training stage, a set of network weights that minimizes training errors is iteratively identified, and determines the contribution of each input variable to the overall network response. Basheer and Hajmeer (38) provide an introduction to this approach. [nnet, nnet]
§§§Binary decision trees grown by recursive partitioning. Predictor variables were split to maximize information gain. [tree, tree]
‖ ‖ ‖Two-step algorithm of Hothorn and colleagues (39). The variable most strongly associated with class labels among training examples is selected, and a decision tree branch is formed through a binary split of this chosen variable. The process repeats until all predictors significantly associated with class labels (at level α) have been incorporated. Listed accuracy obtained using α = 0.30. [party, ctree]
¶¶¶For each test case, the k most similar observations among the training data are identified. The test case is assigned the most frequently occurring class label among the k most similar training observations. Listed accuracy obtained for k = 5. (see 40). [klaR, knn]
****Decision tree induction following Hunt's algorithm, in which trees are recursively grown by splitting attributes to maximize information gain (41). [RWeka, J48]
††††Partial decision trees (42). [RWeka, PART]
‡‡‡‡Least-squares multiple regression. Life-span quartile is treated as a numeric ordinal variable, and model selection is performed using the Akaike Information Criterion. [RWeka, LinearRegression]
§§§§Test cases are classified according to a series of “if…then” rules extracted from training data using the RIPPER algorithm (43). [RWeka, JRip]
‖ ‖ ‖ ‖No model building is performed, and class labels are randomly assigned to test cases.
Figure 2.Overlap between short- and long-lived life-span quartiles. In each plot, open circles (○) represent mice associated with the shortest-lived quartile, whereas plus signs (+) represent mice associated with the longest-lived quartile. A and B, Quartile relationships with respect to body weight and the CD8M T-cell subset at 8 and 18 months of age, respectively. C, All mice in the data set plotted with respect to two linear discriminant axes, which were derived from the 29 continuous predictor variables listed in Table 1. D, Mice plotted with respect to the first two principal components derived from the 29 continuous predictor variables. The two principal components account for 6.8% of the total variation among the 29 variables
Figure 3.Random Forest evaluation of variable importance. Variables listed near the top of the figure are most important for prediction of life-span quartile. The Random Forest algorithm constructs a “forest” of decision trees, where each tree generates predictions based on a randomly chosen set of predictor variables. Predictions are then generated by majority vote among all trees in the forest (33). For the 741 mice we considered, the algorithm generates forests using a bootstrap sample of approximately 500 mice; remaining mice are used as an internal testing set for accuracy evaluation. For a given predictor X, predictive accuracy is evaluated before and after permuting values of X among mice. The difference between the two obtained accuracies is then determined and used as a measure of variable importance. This procedure was repeated 10,000 times using forests of 2000 decision trees per trial. The figure shows the average accuracy change among all 10,000 trials
Figure 4.Variable subset evaluation. Prediction accuracy was evaluated for different variable subsets using the Nearest Shrunken Centroid algorithm. Variables were entered into the model according to their importance, as indicated by results shown in Figure 3. The first model included variables CD4V_18 and W10 as life-span predictors. The accuracy obtained using this model was evaluated by 10-fold cross-validation and 10,000 simulations. The average accuracy among all simulations is represented by the leftmost bar. Each subsequent bar from left to right indicates the accuracy obtained by adding the next most important variable to the model. The rightmost bar represents the mean accuracy obtained using the full model with all 31 predictor variables
Figure 5.Simulation results. The Nearest Shrunken Centroid (NSC) algorithm was implemented using five predictor variables (CD4M_8, CD4V_18, W8, W10, and W18). The plot shows the distribution of accuracies obtained on the testing set among 10,000 simulation trials. Solid line: Accuracy distribution obtained by a random guessing algorithm. Dotted line: Accuracy distribution obtained by the NSC algorithm. Solid vertical line: Mean accuracy among all trials for the random guessing algorithm (approximately 25%). Dotted vertical line: Mean accuracy among all trials for the NSC algorithm (35.3%)
Figure 6.Predicted life span and survivorship. The life span of each mouse in the data set was predicted using the Nearest Shrunken Centroid and the leave-one-out method (see text). Five predictors were used in the model (CD4M_8, CD4V_18, W8, W10, and W18). Dotted line: Cohort of mice with predicted life span in the upper two quartiles. Solid line: Cohort of mice with predicted life span in the lower two quartiles. Dotted vertical line: Minimum life span of mice that were included in the analysis (720 days). Zero survivorship is indicated by the dotted horizontal line.
Figure 7.Posterior probability and life span. The posterior probability of belonging to the longest-lived quartile was determined for each mouse in the data set using the leave-one-out method (see text). Five predictors were used in the model (CD4M_8, CD4V_18, W8, W10, and W18). Dotted line: Least-squares regression line
Figure 8.Effect of training set size on accuracy. The number of training samples was varied between 10% and 90% of the total data set (between 75 and 667 mice). For each training set size, the mean accuracy obtained by the Nearest Shrunken Centroid algorithm was determined by 10,000 simulations (using 10-fold cross-validation). The mean accuracies obtained for each training set size are indicated by the solid line. Dashed lines: Forecasted accuracy for larger training set sizes (between 667 and 1000 mice). Middle dashed line: Forecasted accuracy. Top and bottom dashed lines: Standard error margins. Forecasts were generated using a moving average model with two parameters and two degrees of differencing. Model selection was based on the Akaike Information Criterion described by Brockwell and Davis (72)