Literature DB >> 18769539

Identification of a 5-protein biomarker molecular signature for predicting Alzheimer's disease.

Martín Gómez Ravetti1, Pablo Moscato.   

Abstract

BACKGROUND: Alzheimer's disease (AD) is a progressive brain disease with a huge cost to human lives. The impact of the disease is also a growing concern for the governments of developing countries, in particular due to the increasingly high number of elderly citizens at risk. Alzheimer's is the most common form of dementia, a common term for memory loss and other cognitive impairments. There is no current cure for AD, but there are drug and non-drug based approaches for its treatment. In general the drug-treatments are directed at slowing the progression of symptoms. They have proved to be effective in a large group of patients but success is directly correlated with identifying the disease carriers at its early stages. This justifies the need for timely and accurate forms of diagnosis via molecular means. We report here a 5-protein biomarker molecular signature that achieves, on average, a 96% total accuracy in predicting clinical AD. The signature is composed of the abundances of IL-1alpha, IL-3, EGF, TNF-alpha and G-CSF. METHODOLOGY/PRINCIPAL
FINDINGS: Our results are based on a recent molecular dataset that has attracted worldwide attention. Our paper illustrates that improved results can be obtained with the abundance of only five proteins. Our methodology consisted of the application of an integrative data analysis method. This four step process included: a) abundance quantization, b) feature selection, c) literature analysis, d) selection of a classifier algorithm which is independent of the feature selection process. These steps were performed without using any sample of the test datasets. For the first two steps, we used the application of Fayyad and Irani's discretization algorithm for selection and quantization, which in turn creates an instance of the (alpha-beta)-k-Feature Set problem; a numerical solution of this problem led to the selection of only 10 proteins.
CONCLUSIONS/SIGNIFICANCE: the previous study has provided an extremely useful dataset for the identification of AD biomarkers. However, our subsequent analysis also revealed several important facts worth reporting: 1. A 5-protein signature (which is a subset of the 18-protein signature of Ray et al.) has the same overall performance (when using the same classifier). 2. Using more than 20 different classifiers available in the widely-used Weka software package, our 5-protein signature has, on average, a smaller prediction error indicating the independence of the classifier and the robustness of this set of biomarkers (i.e. 96% accuracy when predicting AD against non-demented control). 3. Using very simple classifiers, like Simple Logistic or Logistic Model Trees, we have achieved the following results on 92 samples: 100 percent success to predict Alzheimer's Disease and 92 percent to predict Non Demented Control on the AD dataset.

Entities:  

Mesh:

Substances:

Year:  2008        PMID: 18769539      PMCID: PMC2518833          DOI: 10.1371/journal.pone.0003111

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Recently, Ray et al. [1] made a significant contribution to the quest of finding a superior molecular test for an earlier diagnosis of Alzheimer's disease (AD). The method appears to have significantly improved on the state-of-the-art and, as a consequence, their results attracted immediate worldwide attention. Using the abundance of 120 signalling proteins on a training set of 83 archived plasma samples, they produced an 18-protein signature. On two separate test sets of 92 (“AD” Alzheimer's samples against control) and 47 (“MCI” mild cognitive impairment samples) the signature was able to show an overall effectiveness of 81% and 91% for AD predictability. We started this project by analysing the dataset made available and we are glad to report that we have been able to perfectly reproduce their mathematical methods and results from the available datasets. However, our subsequent analysis also produced several important facts worth reporting: using an integrative bioinformatics approach, we identified a 6-protein signature that halves the number of errors in prediction of the previously proposed signature (on the “AD” dataset.), when using the same classifier (PAM). A 5-protein signature (which is a subset of the 18-protein signature of Ray et al.) has the same overall performance. Finally, using more than 20 different classifiers available in the widely-used Weka software package [2], our 5-protein signature has, on average, a smaller prediction error indicating the independence of the classifier and the robustness of this set of biomarkers (i.e. 96% accuracy when predicting AD against non-demented control). The 6-protein signature is composed of the abundances of IL-1α, IL-3, IL-6, EGF, TNα and G-CSF. We remark that IL-6 was not selected by Ray et al. in the preliminary gene selection, and as a consequence it is not part of their 18-protein signature. Recognising that the importance of IL-6 as a biomarker for AD is debatable and that many classifiers do not make use of its abundance to inform decisions, we also present our results of a 5-protein signature that ignores IL-6.

Results

Base case–analysis of the performance of randomly selected signatures

Before reporting our experimental results, it was important to understand the worst possible performance results that a set of k proteins can have when they are selected at random (from the available 120 proteins under study). We showed results of two experiments that aim at quantifying this. We showed the classification performance of 20 signatures with 18 proteins selected at random with a uniform distribution (obviously, we have selected 18 as is the same number of proteins as the signature proposed by Ray et al.). Analogously, we performed the same experiment now constrained to select only six proteins chosen at random (as we will later present comparative results using signatures that only employ 6 and 5 proteins). The two different collections of 20 sets of randomly generated signatures were chosen using an equal probability for each of the 120 proteins in the set (obviously, not allowing repetitions and constrained to have either 18 or 6 different proteins in total). For this experiment, we decided to use a random forests algorithm (RF) as a base classifier (we are using the algorithm implemented in [3] for reproducibility purposes), generating 150 trees. As the chosen classifier also has a stochastic nature, for each signature we ran 10 experiments with different seeds, and the results we found are quite interesting. For these twenty 18-protein signatures the average error over the 92 samples considered on the “AD” test set, is 15.13 meaning an 84% effectiveness, see Table 1. For the 6-protein case, an average error of 30.5% was observed meaning that an expected lower value of 67% effectiveness was found, see Table 2. With these results we can infer that the original selection of the 120 genes is quite remarkable for revealing biomarkers for prediction of clinical AD. Since a random selection with a simple, yet robust, classification method allows us to find “good” 18-protein predictor with only a random selection procedure restricted to these 120 proteins. Table 3, Figure 1 and Figure 2 resume the experiment.
Table 1

Number of errors from the 18-genes randomly selected signatures on the “AD” validation test set.

Seed NumberS18-1S18-2S18-3S18-4 S18-5 S18-6S18-7S18-8 S18-9 S18-10
7618141118 29 181110 4 10
14418151219 25 171311 7 13
12118151022 25 19118 7 13
8317141121 27 181312 6 15
3320181220 27 161111 6 15
5115161121 26 17128 6 15
16215131320 24 21148 7 13
3713141121 29 20109 7 11
13617161322 23 201010 5 14
6018101117 22 18109 7 15
Average Error 16.9 14.5 11.5 20.1 25.7 18.4 11.5 9.6 6.2 13.4
Average Accuracy 81.6% 84.2% 87.5% 78.2% 72.1% 80.0% 87.5% 89.6% 93.3% 85.4%

The Random forest algorithm was used as classifier. For each signature 10 runs with different seeds were done. We used the WEKA software implementation, and the algorithm was allowed to generate 150 trees. The best and worst signatures are highlighted in bold text. In two cases we found signatures that classify above 90%, comparable with the results of Ray et al. that report on 91% AD predictability as a result of their proposed methodology.

Table 2

Number of errors from the 6-genes randomly selected signatures on the “AD” validation test set.

Seed NumberS6-1S6-2S6-3S6-4S6-5S6-6S6-7S6-8S6-9S6-10
7640342031313229322434
14440321934323330312333
12138371833353028322731
8340331931333427272431
3341331735333027282729
5139331928343028282430
16241351931363428272633
3740331732312927352432
13642361934343230342426
6040351728273129322329
Average Error 40.1 34.1 18.4 31.7 32.6 31.5 28.3 30.6 24.6 30.8
Average Accuracy 56.4% 62.9% 80.0% 65.5% 64.6% 65.8% 69.2% 66.7% 73.3% 66.5%

The Random forest algorithm was used as classifier, for each signature 10 runs with different seeds were done. We used the WEKA software implementation, and the algorithm was allowed to generate 150 trees. The best and worst signatures are highlighted in bold text. This result shows what it is expected, that a 6-signature, when the biomarkers are randomly chosen, is performing significantly worse than the panel of 18 biomarkers selected by Ray et. al. Now the best result (81.5%) is worse than the average result of a random 18-signature (86%).

Table 3

Random experiments report.

18-gene random signatures6-gene random signatures
Average Error15.1430.59
Best Signature (average)6.217
Worst Signature (average)25.740.5
Standard Deviation5.366.21
Accuracy Average 83.5% 66.7%

The table shows the average results of the 20 random signatures for each size, also including the best and worst results and the standard deviation. The accuracy average is calculated considering the error average over the 92 samples of “AD” validation test set.

Figure 1

Histograms of the number of errors of the random forest classifier using 20 randomly selected signatures with 18 proteins.

The arrow indicates the results under the same conditions of the 18-protein signature proposed by Ray et al.

Figure 2

Histograms of the number of errors considering the random forest classifier and the 20 randomly selected signatures with 6 proteins.

The arrow indicates the results under the same conditions of our 6-protein signature.

Histograms of the number of errors of the random forest classifier using 20 randomly selected signatures with 18 proteins.

The arrow indicates the results under the same conditions of the 18-protein signature proposed by Ray et al.

Histograms of the number of errors considering the random forest classifier and the 20 randomly selected signatures with 6 proteins.

The arrow indicates the results under the same conditions of our 6-protein signature. The Random forest algorithm was used as classifier. For each signature 10 runs with different seeds were done. We used the WEKA software implementation, and the algorithm was allowed to generate 150 trees. The best and worst signatures are highlighted in bold text. In two cases we found signatures that classify above 90%, comparable with the results of Ray et al. that report on 91% AD predictability as a result of their proposed methodology. The Random forest algorithm was used as classifier, for each signature 10 runs with different seeds were done. We used the WEKA software implementation, and the algorithm was allowed to generate 150 trees. The best and worst signatures are highlighted in bold text. This result shows what it is expected, that a 6-signature, when the biomarkers are randomly chosen, is performing significantly worse than the panel of 18 biomarkers selected by Ray et. al. Now the best result (81.5%) is worse than the average result of a random 18-signature (86%). The table shows the average results of the 20 random signatures for each size, also including the best and worst results and the standard deviation. The accuracy average is calculated considering the error average over the 92 samples of “AD” validation test set. It is remarkable that by choosing 18 proteins at random we were able to obtain a very good signature, at least for this classifier, under the conditions explained above. Perhaps the reason of obtaining such good signatures is that a smaller number of proteins, that all signatures have in common, is all that it is needed for predictive molecular signature. Figures 1 and 2 show the relation between the considered signatures with 18 and 6 proteins and the random ones.

Computational studies: Results obtained with four different signatures

We report all the results obtained using a set of 24 classifiers which have been selected from the Weka software suite [3], aiming at sampling different algorithmic methodologies in current practice. These classifiers are applied having as input the four different signatures with the same training set. To ensure reproducibility of our reported methods, no parameter was modified from the classifier's default setting from Weka's downloaded code. In this way we were not biasing the experiment with ad hoc parameter selection and we ensure the complete reproducibility of our claims. We are also aware that better results are possible when adjusting the parameters of each classifier considering only the samples of the training set. Nevertheless, with these tests our objective is to show the robustness of our methods to discovery biomarkers, by showing the independence of the signature performance from the selected classifier. It is interesting to note that the mathematical model and algorithms we have used have pointed at Interleukin-6 and included it in the 10-protein signature. It is well known that IL-6 with other cytokines have been the subject of many studies of biomarkers for Alzheimer's disease [4]–[6]. Using an integrative bioinformatic approach, described in the next sections, we draw our attention to a smaller signature. The 6-protein signature was obtained by the analysis of the protein-relation graph and interestingly enough, IL-6 is also included in this new core signature. Finally, in the 5-protein signature, IL-6 is excluded to provide another comparison and the five proteins now become a proper subset of the 18 original proteins uncovered by Ray et al. Table 4 presents the genes included in each signature, indicating the protein name, Entrez GeneID and official name.
Table 4

Protein name for each signature used in the computational experiment.

Protein NameEntrez GeneIDOfficial gene name provided by HUGO Gene Nomenclature Committee (HGNC)In signature
181065
ANG-2285angiopoietin 2x
CCL5/RANTES6352chemokine (C-C motif) ligand 5x
CCL7/MCP-36354chemokine (C-C motif) ligand 7xx
CCL15/MIP-1δ6359chemokine (C-C motif) ligand 15xx
CCL18/PARC6362chemokine (C-C motif) ligand 18 (pulmonary and activation-regulated)x
CXCL8/IL-83576interleukin 8x
EGF1950epidermal growth factor (beta-urogastrone)xxxx
G-CSF1440colony stimulating factor 3 (granulocyte)xxxx
GDNF2668glial cell derived neurotrophic factorx
ICAM-13383intercellular adhesion molecule 1 (CD54), human rhinovirus receptorx
IGFBP-63489insulin-like growth factor binding protein 6x
IL-1α3552interleukin 1, alphaxxxx
IL-33562interleukin 3 (colony-stimulating factor, multiple)xxxx
IL-63569interleukin 6 (interferon, beta 2)xx
IL-113589interleukin 11xx
M-CSF1435colony stimulating factor 1 (macrophage)x
PDGF-BB5155platelet-derived growth factor beta polypeptide (simian sarcoma viral (v-sis) oncogene homolog)xx
TNF-α7124tumor necrosis factor (TNF superfamily, member 2)xxxx
TRAIL R48793tumor necrosis factor receptor superfamily, member 10d, decoy with truncated death domainx
Tables 5, 6, 7 and 8 show the results of the 24 classifiers for all the signatures considered. The classifiers marked with a star have a random component; therefore the average of ten runs with different seeds is reported. Finally, Tables 9 and 10 summarize the results.
Table 5

Report of the results of the 24 classifiers when using the 18-Protein biomarker.

ClassifierGrand TotalOVERALL (“AD”+“MCI”)Test Set “AD”Test Set “MCI”
AD Er.NAD Er.AD Er.NAD Er.AD Er.NAD Er.
Dataset size139647542502225
PAM 217144638
SMO 205152639
Simple Logistic 2510155659
Logistic 2711166759
Multilayer Perceptron* 21.710.111.643.36.18.3
Bayes Net 2772037413
Naïve Bayes 2341915314
Naïve Bayes Simple 2341915314
Naïve Bayes Up 2341915314
IB1 2151623313
Ibk 2151623313
Kstar 28523211312
LWL 281513531010
AdaBoost 2391443511
ClassViaRegression 28141454910
Decorate* 23.17.915.23.35.24.610
MultiClass Classifier 2711166759
Random Committee* 26.110.1164.45.55.710.5
j48 24131132109
LMT 2510155659
NBTree 2613135489
Part 2514117279
Random Forest* 24.39.3154.145.211
Ordinal Classifier 24131132109
Average 24.34 9.02 15.33 3.66 4.79 5.36 10.53
Agreement (%) 82% 86% 80% 91% 90% 76% 58%

18-Protein Signature (Ray et al.)

Table 6

Report of the results of the 24 classifiers when using the 10-Protein biomarker.

10-Protein Signature
ClassifierGrand TotalOVERALL (“AD”+“MCI”)Test Set “AD”Test Set “MCI”
AD Er.NAD Er.AD Er.NAD Er.AD Er.NAD Er.
Dataset size139647542502225
PAM 2351838210
SMO 2371626510
Simple Logistic 2341918311
Logistic 246181959
Multilayer Perceptron* 21.84.916.91.26.93.710
Bayes Net 2872118613
Naïve Bayes 31625212413
Naïve Bayes Simple 31625212413
Naïve Bayes Up 31625212413
IB1 2862239313
Ibk 2862239313
Kstar 39336018318
LWL 281513531010
AdaBoost 2241818310
ClassViaRegression 2381515710
Decorate* 25.16.718.41.685.110.4
MultiClass Classifier 246181959
Random Committee* 25.89.915.93.36.46.69.5
j48 2211113289
LMT 37172081298
NBTree 191365383
Part 2110113279
Random Forest* 23.99.414.52.756.79.5
Ordinal Classifier 2211113289
Average 25.99 7.83 18.15 2.45 7.64 5.38 10.52
Agreement (%) 81% 88% 76% 94% 85% 76% 58%
Table 7

Report of the results of the 24 classifiers when using the 6-Protein biomarker.

6-Protein Signature
ClassifierGrand TotalOVERALL (“AD”+“MCI”)Test Set “AD”Test Set “MCI”
AD Er.NAD Er.AD Er.NAD Er.AD Er.NAD Er.
Dataset size139647542502225
PAM 208121379
SMO 209112279
Simple Logistic 18 4 14 0 4 4 10
Logistic 2141707410
Multilayer Perceptron* 25.63.222.40.492.813.4
Bayes Net 2281434510
Naïve Bayes 2381525610
Naïve Bayes Simple 2491535610
Naïve Bayes Up 2381525610
IB1 33924311613
Ibk 33924311613
Kstar 33627113514
LWL 291613631010
AdaBoost 27111636810
ClassViaRegression 2310133677
Decorate* 24.79.814.92.44.87.410.1
MultiClass Classifier 2141707410
Random Committee* 26.611.515.13.15.68.49.5
j48 2410142589
LMT 18 4 14 0 4 4 10
NBTree 2110111299
Part 27131435109
Random Forest* 25.611.813.82.64.49.29.4
Ordinal Classifier 2410142589
Average 24.44 8.60 15.84 2.02 5.70 6.58 10.14
Agreement (%) 82% 87% 79% 95% 89% 70% 59%

Using this biomarker it is notable the effectiveness of predicting AD on the “AD” test set using simple classifiers as simple logistic or LMT (Logistic Model Tree) or even the same classifier used in [1] (PAM).

Table 8

Report of the results of the 24 classifiers when using the 5-Protein biomarker.

5-Protein Signature
ClassifierGrand TotalOVERALL (“AD”+“MCI”)Test Set “AD”Test Set “MCI”
AD Er.NAD Er.AD Er.NAD Er.AD Er.NAD Er.
Dataset size139647542502225
PAM 2110113279
SMO 198112269
Simple Logistic 18 4 14 0 4 4 10
Logistic 2041606410
Multilayer Perceptron* 21.65.316.30.75.24.611.1
Bayes Net 2141715312
Naïve Bayes 1951412412
Naïve Bayes Simple 2051513412
Naïve Bayes Up 1951412412
IB1 30102037713
Ibk 30102037713
Kstar 2681837511
LWL 291613631010
AdaBoost 31328111217
ClassViaRegression 2451917412
Decorate* 21.88.713.11.73.979.2
MultiClass Classifier 2041606410
Random Committee* 26.110.915.23.15.17.810.1
j48 2410142589
LMT 18 4 14 0 4 4 10
NBTree 2110111299
Part 27131435109
Random Forest* 26.212.114.13.24.98.99.2
Ordinal Classifier 2410142589
Average 23.20 7.71 15.49 1.78 4.75 5.93 10.73
Agreement (%) 83% 88% 79.4% 96% 90% 73% 57%

Removing IL-6 from the biomarker set we have a small gain in predicting AD in both data set, if compared to the 6-protein signature. In this case, the prediction of AD on the “AD” test set achieves an average of 96% without dropping the accuracy of the prediction of NonAD.

Table 9

Average results for each signature over 24 classifiers.

SizeOverallOverall (“AD”+“MCI”)Test set “AD”Test set “MCI”
AD Er.NAD Er.AD Er.NAD Er.AD Er.NAD Er.
139647542502225
18 protein Sig. Error Avg24.349.0215.333.664.795.3610.53
Agr %82%86% 80% 91% 90% 76% 58%
82%91% 66%
10 protein Sig. Error Avg25.987.8318.152.457.645.3810.52
Agr %81% 88% 76%94%85% 76% 58%
81%89% 66%
6 protein Sig. Error Avg24.448.6015.842.025.706.5810.14
Agr %82%87%79%95%89%70% 59%
82%92%64%
5 protein Sig. Error Avg23.207.7115.491.784.755.9310.73
Agr % 83% 88% 79% 96% 90% 73%57%
83% 93% 65%

For each signature the average number of errors is reported and the percentage agreement is calculated over each specific population. The best results are highlighted in bold text.

Table 10

The standard deviation of each test is shown on this table.

Overall (“AD”+“MCI”)Test set ADTest set MCI
AD Er.NAD Er.AD Er.NAD Er.AD Er.NAD Er.
18 protein Sig. 3.5803.0221.6922.0872.4301.982
10 protein Sig. 3.5466.1271.7213.8932.2142.729
6 protein Sig. 3.1654.2181.4192.7982.0241.625
5 protein Sig. 3.5203.6681.4332.1752.3261.906

All the signatures show a very similar behaviour with a small standard deviation.

18-Protein Signature (Ray et al.) Using this biomarker it is notable the effectiveness of predicting AD on the “AD” test set using simple classifiers as simple logistic or LMT (Logistic Model Tree) or even the same classifier used in [1] (PAM). Removing IL-6 from the biomarker set we have a small gain in predicting AD in both data set, if compared to the 6-protein signature. In this case, the prediction of AD on the “AD” test set achieves an average of 96% without dropping the accuracy of the prediction of NonAD. The results of our 5-protein signature are reported in Table 8. When considering the “AD” test set, average results (over 24 classifiers) are obtained by the 5-protein signature, 96% when predicting AD and 90% when predicting non-demented control. It is also worth mentioning that there are four different classifiers achieving almost 100% accuracy (i.e. having a number of errors smaller or equal to 1) for predicting AD on the “AD” test set. These results are achieved without losing accuracy when predicting non-demented controls on the same dataset. In Table 9, a feature of the experiments it is worth commenting: all the signatures drop at least 30% in accuracy when considering the “MCI” dataset. This is understandable since the classifiers have no sample labelled “MCI” in the training set. For each signature the average number of errors is reported and the percentage agreement is calculated over each specific population. The best results are highlighted in bold text. The best overall result, considering both test sets, is obtained by the 6-protein and 5-protein signatures. They present 18 errors and in both signatures this result is obtained twice when using the LMT and Simple Logistic classifiers (Tables 7 and 8). In Table 10, the standard deviations of the number of errors are almost constant for all signatures, in all datasets. This reinforces our previous claim, the poor performance of the signatures on the “MCI” dataset is related to the fact that the signatures were not trained to identify between AD and MCI. All the signatures show a very similar behaviour with a small standard deviation. To present the experiment results in another form, we compared the performance of each signature in each test. Table 11 presents the comparison between the signatures when considering all the test sets (“AD”+“MCI”) totalling 139 samples. It is remarkable that the 5-protein signature not only has a better average performance, but also presents the best result on 16 of the 24 algorithms used for classification (the number of errors highlighted in bold text indicates the best performance for this particular classifier).
Table 11

Number of errors for each classifier when considering both test sets together (139 samples).

MethodOverall errors
18 10 6 5
Simple Logistic252518 18
LMT2525 18 18
Logistic272421 20
MultiClass Classifier272421 20
Bayes Net272822 21
NBTree2623 21 21
Naïve Bayes233023 19
Naïve Bayes Up.233023 19
ClassViaRegression2825 23 24
Naïve Bayes Simple233024 20
Kstar284133 26
Decorate23.128.324.7 21.8
SMO202320 19
Multilayer Perceptron21.721.825.6 21.6
PAM2122 20 21
Random Committee 26.1 26.326.6 26.1
j48 24 24 24 24
Ordinal Class Classifier 24 24 24 24
LWL 28 28 2929
Random Forest 24.3 24.3 25.626.2
Part 25 302727
AdaBoost 23 312731
IB1 21 283330
Ibk 21 283330
Average24.34226.82124.438 23.196
Agreement %82%81%82% 83%

The signature with the best performance on each classifier is highlighted in bold text.

The signature with the best performance on each classifier is highlighted in bold text. In Table 12, the same comparison is made but only considering the “AD” test set. Once again, it is possible to visualize the performance of the 5-protein signature, obtaining not only the best average result but also the best individual results, presenting 3 errors on 3 occasions.
Table 12

Number of errors for each classifier when considering the “AD” test set (92 samples).

Method“AD” test set
18 10 6 5
NBTree98 3 3
Simple Logistic119 4 4
LMT1120 4 4
Logistic13107 6
MultiClass Classifier13107 6
PAM1011 4 5
SMO88 4 4
Naïve Bayes6147 3
Naïve Bayes Up.6147 3
Bayes Net1097 6
Decorate8.59.67.2 5.6
Naïve Bayes Simple6148 4
Kstar131814 10
Multilayer Perceptron7.38.19.4 5.9
Random Committee9.99.78.7 8.2
ClassViaRegression9 6 98
Part9 5 88
Random Forest8.17.7 7 8.1
LWL 8 8 99
j48 5 5 77
Ordinal Class Classifier 5 5 77
AdaBoost 7 9912
IB1 5 121410
Ibk 5 121410
Average8.4510.097.72 6.53
Agreement %91%89%92% 93%

The signature with the best performance on each classifier is highlighted in bold text.

The signature with the best performance on each classifier is highlighted in bold text. Finally, Table 13 presents the same analysis for the “MCI” test set. In this case the most remarkable observation is the lack of quality to predict MCI-AD. The improved performance of the largest signatures is related to the fact that the signatures have more genes, and because they were not trained to distinguish between MCI patients, the use of more proteins allows a slightly better performance. Nevertheless, even the best signature for this case (a 10-protein signature) presents a poor performance when compared with the previous results.
Table 13

Number of errors for each classifier when considering the “MCI” test set (47 samples).

Method“MCI” test set
18 10 6 5
ClassViaRegression1917 14 16
Bayes Net1719 15 15
j4819 17 17 17
Ordinal Class Classifier19 17 17 17
Naïve Bayes1717 16 16
Naïve Bayes Simple1717 16 16
Naïve Bayes Up.1717 16 16
Simple Logistic 14 14 14 14
Logistic 14 14 14 14
LWL 20 20 20 20
MultiClass Classifier 14 14 14 14
LMT 14 17 14 14
NBTree17 11 1818
Kstar 15 211916
Multilayer Perceptron14.4 13.7 16.215.7
Random Committee16.2 16.1 17.917.9
Decorate 14.6 15.517.516.2
Random Forest 16.2 16.2 18.618.1
AdaBoost16 13 1819
Part 16 16 1919
IB1 16 16 1920
Ibk 16 16 1920
SMO 12 151615
PAM 11 121616
Average16.2916.11 16.78 16.77
Agreement %65%66% 64% 64%

The signature with the best performance on each classifier is highlighted in bold text.

The signature with the best performance on each classifier is highlighted in bold text.

Discussion

In conclusion, it is clear that the experiment performed by Ray et al. provided an extremely useful dataset for the identification of Alzheimer's disease biomarkers. We have uncovered a robust 5-protein signature with near 97% of accuracy to predict AD against non-demented controls using their data. Our signature has less than one third of the proteins than the one proposed in the original paper, and at least the same level of prediction performance. The next step on this important quest is to set up an independent experimental procedure that now considers samples with mild cognitive impairment (but without AD) in the training set. We do not agree with the methodology of using a training set without MCI to select biomarkers to differentiate between AD and MCI [1]. This has not been done and warrants further investigation. Only in this way we can uncover useful biomarkers to discriminate between AD and MCI. On the positive side, our methods reveal the true predictive potential of testing for Alzheimer's disease using this panel of signalling proteins. We also believe that our methods show promise and warrant their application in other settings. It is clear that Alzheimer researchers can benefit directly from our identification of more robust biomarkers. The method is revealed to be useful, simple yet very powerful, and warrants its application in other multifactorial diseases.

Methods

Our methodology consisted of the application of an integrative data analysis method. We used four steps: a) abundance quantization, b) feature selection, c) literature analysis, d) selection of a classifier algorithm which is independent of the feature selection process. These steps were performed without using any of the test datasets. For the first two steps, we used the application of Fayyad and Irani's discretization algorithm [7] for selection and quantization, which in turn creates an instance of the (alpha-beta)-k-Feature Set problem [8]–[10]. Fayyad and Irani's method filtered only 14 out of 120 proteins of the training set (i.e. those proteins for which no threshold was selected were filtered out). After quantization, samples 7, 43 (AD, “Alzheimer's Disease”) and 48 (NDC, “Nondemented Control”) of the training set were “in conflict”, which means that they have quantized values (for all 14 proteins selected) which are the same although they belong to different classes. These conflicts are then removed, i.e. the three samples of the training set are eliminated and we apply our algorithms to the remaining 80 samples of the training set. Numerical solution of the (alpha-beta)-k-Feature Set problem led to the selection of only 10 proteins, Table 4. For a detailed explanation of the methods and other applications, readers can check our referenced publications and references therein [11]–[13]. To guarantee the reproduction of all our experiments, we use algorithms from the Weka Package [3] as classifiers. All the classifiers were used with the default parameters; we are convinced that better results could be found if adjustments are made in each classifier (considering only its result over the training set). The first signature we uncovered contains 10 proteins, see Table 4. Using the Pathway Studio software [3], we generated an undirected graph of the known ‘direct relations’ of these 10 proteins. Each node in the graph corresponds to a protein and an edge exists if the Pathway Studio software produced a ‘direct relation’, indicating important association already observed in the life sciences literature. On this graph we looked for its maximum clique (Fig. 3a). We denote this graph as G = (V,E). Each vertex in V has a one-to-one correspondence with a protein. Each pair of vertices are connected by an edge in E, if and only if, there are many direct relations between the proteins reported in the literature. A clique in G is a subset X of V such that its induced graph G[X] is complete. In other words, we are looking for the maximum subset of proteins, in which all pairs of proteins already have a direct relationship identified between them, thus we consider this set the core of our 10-protein signature (this core has the 6-proteins listed above, see Fig. 3b).
Figure 3

Classification and prediction of clinical Alzheimer's diagnosis in subjects with Alzheimer's disease.

(a) An undirected graph, where each node corresponds a different protein belonging to the 10-protein signature we identified; each edge indicates the existence of a direct relation as obtained by searching the PubMed database, (using the Pathway Studio software). (b) Identification of the maximum clique of the graph, uncovering a robust 6-protein signature; each node on the clique has a direct relation with each other. Simple Logistic was used to classify and predict Alzheimer's (AD) and non-Alzheimer's class, in the training set (c), the blinded test set ‘AD’ (d). All the results are shown in a confusion matrix, for the training set a 10-fold cross-validation was applied 10 times, in both cases Simple Logistic was used with the default parameters of Weka package. All the p-values were calculated using the Fisher exact test.

Classification and prediction of clinical Alzheimer's diagnosis in subjects with Alzheimer's disease.

(a) An undirected graph, where each node corresponds a different protein belonging to the 10-protein signature we identified; each edge indicates the existence of a direct relation as obtained by searching the PubMed database, (using the Pathway Studio software). (b) Identification of the maximum clique of the graph, uncovering a robust 6-protein signature; each node on the clique has a direct relation with each other. Simple Logistic was used to classify and predict Alzheimer's (AD) and non-Alzheimer's class, in the training set (c), the blinded test set ‘AD’ (d). All the results are shown in a confusion matrix, for the training set a 10-fold cross-validation was applied 10 times, in both cases Simple Logistic was used with the default parameters of Weka package. All the p-values were calculated using the Fisher exact test. Our first benchmark test for this 6-protein signature was done using Simple Logistic (SL) [14], perhaps the simplest classifier from the Weka software suite. With our 6-protein signature, SL had a performance of 86% after applying 10 times 10-fold cross-validation over the training set (Fig. 3c). When considering the “AD” test set, our 6-protein signature with SL was able to make a classification with 97% of accuracy. For AD samples we achieved 100% positive agreement and for NDC samples a 92% negative agreement (Fig. 3d). When using the second test set (labelled “MCI”), that includes samples that had an initial diagnosis of mild cognitive impairment, the performance of all signatures increases the number of errors. It is reasonable to expect that our very trimmed classifiers are going to have some degradation of performance, as they have not been trained to distinguish confirmed AD samples from those that have MCI. When using the same signature to differentiate between AD and other samples of MCI patients, the occurrence of more errors is an expected outcome (Table 9). In spite of this fact, the overall performance of all signatures seems very robust.
  6 in total

1.  Combinatorial optimization models for finding genetic signatures from gene expression datasets.

Authors:  Regina Berretta; Wagner Costa; Pablo Moscato
Journal:  Methods Mol Biol       Date:  2008

Review 2.  Systemic inflammation, infection, ApoE alleles, and Alzheimer disease: a position paper.

Authors:  Caleb E Finch; Todd E Morgan
Journal:  Curr Alzheimer Res       Date:  2007-04       Impact factor: 3.498

3.  A high plasma concentration of TNF-alpha is associated with dementia in centenarians.

Authors:  H Bruunsgaard; K Andersen-Ranberg; B Jeune; A N Pedersen; P Skinhøj; B K Pedersen
Journal:  J Gerontol A Biol Sci Med Sci       Date:  1999-07       Impact factor: 6.053

4.  Increased production of inflammatory cytokines in mild cognitive impairment.

Authors:  Shino Magaki; Claudius Mueller; Cindy Dickson; Wolff Kirsch
Journal:  Exp Gerontol       Date:  2006-11-07       Impact factor: 4.032

Review 5.  Microarrays--identifying molecular portraits for prostate tumors with different Gleason patterns.

Authors:  Alexandre Mendes; Rodney J Scott; Pablo Moscato
Journal:  Methods Mol Med       Date:  2008

6.  Classification and prediction of clinical Alzheimer's diagnosis based on plasma signaling proteins.

Authors:  Sandip Ray; Markus Britschgi; Charles Herbert; Yoshiko Takeda-Uchimura; Adam Boxer; Kaj Blennow; Leah F Friedman; Douglas R Galasko; Marek Jutel; Anna Karydas; Jeffrey A Kaye; Jerzy Leszek; Bruce L Miller; Lennart Minthon; Joseph F Quinn; Gil D Rabinovici; William H Robinson; Marwan N Sabbagh; Yuen T So; D Larry Sparks; Massimo Tabaton; Jared Tinklenberg; Jerome A Yesavage; Robert Tibshirani; Tony Wyss-Coray
Journal:  Nat Med       Date:  2007-10-14       Impact factor: 53.440

  6 in total
  39 in total

1.  Neurodegeneration and the neuroimmune system.

Authors:  Joseph El Khoury
Journal:  Nat Med       Date:  2010-12       Impact factor: 53.440

2.  Exploring molecular pathways of triple-negative breast cancer.

Authors:  Valeria Ossovskaya; Yipeng Wang; Adam Budoff; Qiang Xu; Alexander Lituev; Olga Potapova; Gordon Vansant; Joseph Monforte; Nikolai Daraselia
Journal:  Genes Cancer       Date:  2011-09

3.  Analysis of sampling techniques for imbalanced data: An n = 648 ADNI study.

Authors:  Rashmi Dubey; Jiayu Zhou; Yalin Wang; Paul M Thompson; Jieping Ye
Journal:  Neuroimage       Date:  2013-10-29       Impact factor: 6.556

4.  GSMA: an approach to identify robust global and test Gene Signatures using Meta-Analysis.

Authors:  Adib Shafi; Tin Nguyen; Azam Peyvandipour; Sorin Draghici
Journal:  Bioinformatics       Date:  2020-01-15       Impact factor: 6.937

Review 5.  BioAge: toward a multi-determined, mechanistic account of cognitive aging.

Authors:  Correne A DeCarlo; Holly A Tuokko; Dorothy Williams; Roger A Dixon; Stuart W S MacDonald
Journal:  Ageing Res Rev       Date:  2014-09-30       Impact factor: 10.895

Review 6.  Blood-based biomarkers of Alzheimer's disease: challenging but feasible.

Authors:  Madhav Thambisetty; Simon Lovestone
Journal:  Biomark Med       Date:  2010-02       Impact factor: 2.851

7.  Transactional database transformation and its application in prioritizing human disease genes.

Authors:  Yang Xiang; Philip R O Payne; Kun Huang
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2011-03-16       Impact factor: 3.710

8.  Classification of Alzheimer Diagnosis from ADNI Plasma Biomarker Data.

Authors:  Jue Mo; Stuart Maudsley; Bronwen Martin; Sana Siddiqui; Huey Cheung; Calvin A Johnson
Journal:  ACM Conf Bioinform Comput Biol Biomed Inform (2013)       Date:  2013

9.  Uncovering molecular biomarkers that correlate cognitive decline with the changes of hippocampus' gene expression profiles in Alzheimer's disease.

Authors:  Martín Gómez Ravetti; Osvaldo A Rosso; Regina Berretta; Pablo Moscato
Journal:  PLoS One       Date:  2010-04-13       Impact factor: 3.240

10.  A transcription factor map as revealed by a genome-wide gene expression analysis of whole-blood mRNA transcriptome in multiple sclerosis.

Authors:  Carlos Riveros; Drew Mellor; Kaushal S Gandhi; Fiona C McKay; Mathew B Cox; Regina Berretta; S Yahya Vaezpour; Mario Inostroza-Ponta; Simon A Broadley; Robert N Heard; Stephen Vucic; Graeme J Stewart; David W Williams; Rodney J Scott; Jeanette Lechner-Scott; David R Booth; Pablo Moscato
Journal:  PLoS One       Date:  2010-12-01       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.