| Literature DB >> 23662163 |
Osamu Komori1, Mari Pritchard, Shinto Eguchi.
Abstract
This paper discusses mathematical and statistical aspects in analysis methods applied to microarray gene expressions. We focus on pattern recognition to extract informative features embedded in the data for prediction of phenotypes. It has been pointed out that there are severely difficult problems due to the unbalance in the number of observed genes compared with the number of observed subjects. We make a reanalysis of microarray gene expression published data to detect many other gene sets with almost the same performance. We conclude in the current stage that it is not possible to extract only informative genes with high performance in the all observed genes. We investigate the reason why this difficulty still exists even though there are actively proposed analysis methods and learning algorithms in statistical machine learning approaches. We focus on the mutual coherence or the absolute value of the Pearson correlations between two genes and describe the distributions of the correlation for the selected set of genes and the total set. We show that the problem of finding informative genes in high dimensional data is ill-posed and that the difficulty is closely related with the mutual coherence.Entities:
Mesh:
Year: 2013 PMID: 23662163 PMCID: PMC3639638 DOI: 10.1155/2013/798189
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
A taxonomy of feature selection techniques summarized by Saeys et al. [8]. These major feature selections are addressed. Each type has a subcategory. Advantages, disadvantages, and example methods are shown.
| Model search | Advantages | Disadvantages | Examples |
|---|---|---|---|
|
| |||
| Filter | Fast | Ignores feature dependencies |
|
|
| |||
| Models feature dependencies | Slower than univariate techniques | Correlation-based feature selection | |
|
| |||
|
| |||
| Wrapper | Simple | Risk of overfitting | Sequential forward selection |
|
| |||
| Less prone to local optima | Computationally intensive | Simulated annealing | |
|
| |||
| Embedded | Interacts with the classifier | Classifier dependent selection | Decision trees |
Figure 1Results of classical methods. (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data sets 𝒟 for van't Veer method and DLDA.
Figure 2Results of boosting methods. (a) and (b) show the AUC values (left panel) and the error rate (right panel) over data sets 𝒟 for AdaBoost and AUCBoost.
Figure 3Heat maps of gene expression data with rows representing genes and columns representing patients. (a) D 1–70 (MammaPrint), (b) D 11–80 clearly showing some subtypes and (c) D 111–180 with the highest BHI regarding metastases. The blue bars indicate patients with metastases, red bars those with ER positive and orange bars those with PR positive.
Figure 4Biological homogeneity index (BHI) for metastases (solid line), ER positive (dashed line) and PR positive (dotted line).
Figure 5The rates of ranking in the top 70 genes by correlation coefficients (a) and by AUC (b) based on randomly sampled 50 patients. The horizontal axis denotes 70 genes used by MammaPrint.
Figure 6The values of AUC (left panel) and error rate (right panel) calculated by DLDA using genes ranked in top 70 by the correlation coefficients (a) and by AUC (b). These values are calculated based on randomly sampled 50 patients over 100 trials.
Figure 7(a) Scatter plots of two pairs of genes with highest mutualcoherence (0.984) among 70 genes of MammaPrint. (b) The distribution of the correlation of total 5420 genes (black) and 70 genes of MammaPrint (red). The horizontal axis denotes the index of pairs of genes, based on which the correlations are calculated. The horizontal axis is standardized between 0 and 1 for clear view.