| Literature DB >> 25925353 |
Y-h Taguchi1, Mitsuo Iwadate2, Hideaki Umeyama3.
Abstract
BACKGROUND: Feature extraction (FE) is difficult, particularly if there are more features than samples, as small sample numbers often result in biased outcomes or overfitting. Furthermore, multiple sample classes often complicate FE because evaluating performance, which is usual in supervised FE, is generally harder than the two-class problem. Developing sample classification independent unsupervised methods would solve many of these problems.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25925353 PMCID: PMC4448281 DOI: 10.1186/s12859-015-0574-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Boxplot of typical features with distinct values between four classes (1≤i≤5) in the simulated data set; s=2 (easy), 1 (medium), and 0.5 (hard).
Figure 2Matthews correlation coefficients and F measure for various FE methods applied to the simulated data set.t test, one vs one t test based FE; CRP, categorical regression based FE (using adjusted P-values); CRR, categorical regression based FE (using ranked P-values); BAHSIC, backward elimination using Hilbert-Schmidt norm of the cross-covariance operator; VBPCAFE, variational Bayes principal component analysis based unsupervised FE; CPCAFE, conventional principal component analysis based unsupervised FE.
Figure 3Relationship between and features with distinct expression among four classes. Left column: histogram obtained from logarithmic for 100 independent ensembles. Red color indicates features with distinct expression between four classes. Right column: proportion of features with distinct expression between four classes in each bin. Top: s=2, easy; middle: s=1, medium; bottom: s=0.5, hard cases.
Figure 4A boxplot with distinct values between four classes in 100 independent ensembles of the simulated data set; s=2 (easy), 1 (medium), and 0.5 (hard).
Artificial partial mislabeling introduced to the simulated data set to check robustness of FE
|
|
|
| ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
| |||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
| ||
| 1 | 4 | 1 | 0 | 0 | 4 | 0 | 1 | 0 | 4 | 0 | 0 | 1 | ||
| 2 | 1 | 4 | 0 | 0 | 0 | 4 | 0 | 1 | 0 | 4 | 1 | 0 | ||
| 3 | 0 | 0 | 4 | 1 | 1 | 0 | 4 | 0 | 0 | 1 | 4 | 0 | ||
| 4 | 0 | 0 | 1 | 4 | 0 | 1 | 0 | 4 | 1 | 0 | 0 | 4 | ||
Rows and columns represent true and modified labels. Numbers represent the number of samples with correct and modified labels. (a) little; (b) medium; and (c) heavy mislabeling. Correlation coefficients represent the amount of mislabeling (larger correlation coefficients correspond to less mislabeling).
Figure 5Matthews correlation coefficients and F measure for various FE methods applied to the simulated data set with partial mislabeling, indicated in Table 1. Correlation coefficients between mislabeled and true labeling were (a) 0.92, (b) 0.68, and (c) 0.60, as shown in Table 1. Other notations are the same as in Figure 2.
Figure 6Two-dimensional embedding of miRNA expression determined by PCA. Each dot represents a probe. Red dots indicate the 100 top-ranked probes with larger PC1 scores, selected as outliers based on the criterion of CPCAFE. A few probes with relatively large PC2 scores were excluded to allow selection of PC1 enhanced probes. See Additional file 2 for more detail.
Figure 7Boxplots of logarithmic P-values obtained from t tests comparing control and treatment samples.P-values < 0.5 indicate greater miRNA upregulation in treatment samples than controls (justification for using logarithmic P-values is available in Additional file 2). From left to right, the experimental conditions were 2, 5, and 10 days of stress and 1 day of rest; 5 days of stress and 10 days of rest; and 10 days of stress and 42 days of rest. Right-hand boxes (“selected”) are the 27 miRNAs identified by PCA-based unsupervised FE, and left-hand boxes (“others”) the remaining miRNAs. P-values shown above each plot were calculated using t tests to compare logarithmic P-values between selected and other miRNAs. P
>s (P
Summary of studies with association between the 27 selected miRNAs and heart disease
|
|
|
|
|---|---|---|
| miR-451 | [ | Upregulated in heart due to ischemia |
| miR-22 | [ | Elevated serum levels in patients with stablechronic systolic heart failure |
| miR-133 | [ | Downregulated in transverse aortic constrictionand isoproterenol-induced hypertrophy |
| miR-709 | [ | Upregulated in rat heart four weeks after chronicdoxorubicin treatment |
| miR-126 | [ | Association with outcome of ischemic andnonischemic cardiomyopathy in patients withchronic heart failure |
| miR-30 | [ | Inversely related to CTGF in two rodent modelsof heart disease, and human pathological leftventricular hypertrophy |
| miR-29 | [ | Downregulated in the heart region adjacent toan infarct |
| miR-143 | [ | Molecular key to switching of the vascular smoothmuscle cell phenotype that plays a critical role incardiovascular disease pathogenesis |
| miR-24 | [ | Regulates cardiac fibrosis after myocardial infarction |
| miR-23 | [ | Upregulated during cardiac hypertrophy |
| miR-378 | [ | Cardiac hypertrophy control |
| miR-125 | [ | Important regulator of hESC differentiation to cardiacmuscle(potential therapeutic application) |
| miR-675 | [ | Elevated in plasma of heart failure patients |
| let-7 | [ | Aberrant expression of let-7 members incardiovascular disease |
| miR-16 | [ | Circulating prognostic biomarker in critical limbischemia |
| miR-26 | [ | Downregulated in a rat cardiac hypertrophy model |
| miR-669 | [ | Prevents skeletal muscle differentiation in postnatalcardiac progenitors |
Summary of studies with association between heart disease and KEGG pathways enriched by miRNA target genes
|
|
|
|
|
|---|---|---|---|
| Axon guidance | 3.61×10−14 | [ | Axon guidance of sympathetic neurons to cardiomyocytes is a promising target for regulation of cardiac function in diseased hearts |
| Colorectal cancer | 1.92×10−10 | [ | Significant association between colorectal neoplasm and coronary artery disease |
| Chronic myeloid leukaemia | 1.92×10−10 | [ | Imatinib mesylate (a therapeutic agent for chronic myeloid leukaemia) has cardiotoxicity |
| Glutamatergic synapse | 3.35×10−9 | — | — |
| Hepatitis B | 5.86×10−9 | [ | Hepatitis B virus causes varied forms of heart disease |
| Pancreatic cancer | 6.60×10−9 | — | — |
| Acute myeloid leukaemia | 2.54×10−8 | [ | Cause of acute ischemic heart disease |
| Focal adhesion | 6.65×10−8 | [ | Focal adhesion kinase deletion attenuates pressure overload-induced hypertrophy |
| MAPK signaling pathway | 3.50×10−7 | [ | Plays an important role in cardiac and vascular disease pathogenesis |
| Endometrial cancer | 4.55×10−7 | [ | Cardiovascular disease is the leading cause of death among endometrial cancer patients |
| Chagas disease | 6.33×10−7 | [ | Association with heart disease |
| T cell receptor signaling pathway | 9.82×10−7 | — | — |
| ErbB signaling pathway | 1.01×10−6 | [ | ErbB-signaling pathway proteins are potential drug targets for heart failure treatment |
| Prostate cancer | 2.47×10−6 | [ | Coronary artery disease and prostate cancer are both common diseases sharing many risk factors |
| Neurotrophin signaling pathway | 3.08×10−6 | — | — |
| Toxoplasmosis | 3.16×10−6 | [ | Toxoplasmosis is a cause of heart disease |
| Bacterial invasion of epithelial cells | 3.84×10−6 | — | — |
| TGF-beta signaling pathway | 7.35×10−6 | [ | Upregulated in infarcted myocardium |
| Nonsmall-cell lung cancer | 9.00×10−6 | — | — |
| VEGF signaling pathway | 1.17×10−5 | [ | A VEGF inhibitor has cardiotoxicity |
| Dopaminergic synapse | 2.19×10−5 | [ | Dopamine agonists affect the cardiovascular system |
Figure 8Two-dimensional embedding of mRNA expression by PCA. Notations are the same as in Figure 6. See Additional file 2 for more detail.
Figure 9Comparison of sample contributions to PC1 between mRNA and miRNA.(a) Scatterplot comparing mRNA and miRNA expression profiles for sample contribution to PC1 (i.e., components of the first loading vector). Pearson’s correlation coefficient =−0.37 (P=0.01). (b) Averaged contributions within each condition. Pearson’s correlation coefficient =−0.69 (P=0.01). XY-Zd with X = C (control); X = T (treatment); Y, stress days; and Z, rest days in experimental conditions. See Additional file 2 for more detail.
-values calculated by tests from logarithmic ratios between treated and control samples
|
|
|
|
|
| |
|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
| |
| Control < treated | 6.9×10−4 | 7.4×10−8 | 1.0 | 0.22 | 0.03 |
| Control > treated | 1.0 | 1.0 | 2.35×10−8 | 0.78 | 0.97 |
Smaller P-values indicate that the difference in mean logarithmic ratio is more significant in the selected 59 mRNAs than that in the other mRNAs. For descriptions of each experimental condition, see the caption for Figure 9. For more methodological detail, see Additional file 2.
Summary of studies with association between heart disease and genes selected by CPCAFE
|
|
|
|
|
|---|---|---|---|
|
| NM_010174 | miR-709 | Cardiomegaly (0.031) |
|
| NM_009941 | miR-709 | Cardiac output, low (5×10−4) |
|
| NM_001161419 | miR-29a/b-3p, | Cardiomyopathies (1.5×10−3) |
| miR-16 | |||
|
| NM_009943 | miR-23a/b-3p | Heart disease (1.3×10−3); Cardiomyopathies (2.3×10−3);Heart failure (5.0×10−3) |
|
| NM_001177307 | miR-16 | Cardiomyopathy, dilated (2.0×10−3) |
|
| NM_007747 | miR-26a-5p | Cardiomyopathies (5.4×10−3); Myocardial ischemia(7.4×10−4) |
|
| NM_010861 | miR-1983 | Heart defects, congenital (4.9×10−48); Cardiomyopathy, dilated (5.2×10−21); Cardiomegaly (1.1×10−17); Heart failure (2.1×10−9) |
|
| NM_010859 | miR-691 | Heart defects, congenital (1.1×10−7); Cardiomegaly(1.2×10−4); Myocardial infarction (2.4×10−4); Heart septal defects, ventricular (6.9×10−4) |
|
| NM_011540 | miR-207 | Cardiomyopathy, dilated (1.1×10−8); Cardiomyopathy, hypertrophic (1.7×10−3); Heart defects, congenital(6.0×10−3); Heart failure (1.2×10−2) |
KEGG pathway enriched by 24 mRNAs targeted by 27 miRNAs, identified by DAVID
|
|
|
|
|
|
|---|---|---|---|---|
| Cardiac muscle contraction | 7 | 30.4 | 2.30E-09 | 3.20E-08 |
| Parkinson’s disease | 7 | 30.4 | 5.80E-08 | 4.10E-07 |
| Oxidative phosphorylation | 6 | 26.1 | 2.30E-06 | 1.10E-05 |
| Alzheimer’s disease | 6 | 26.1 | 1.20E-05 | 4.20E-05 |
| Huntington’s disease | 6 | 26.1 | 1.20E-05 | 3.50E-05 |
Number of genes are genes included in pathway, % is the ratio genes included in pathway among 24 genes. P-values and those adjusted by BH criterion were provided by DAVID.
Figure 10Schematic that illustrates biological validations towards 27 miRNAs and 59 mRNAs.
Top ranked FABP3 inhibitor compounds identified by drug discovery
|
|
|
|
|
|
|---|---|---|---|---|
| 1 | 907.4 | Oxaprozin | DB00991/CHEMBL1071 | PTGS1 (Cox1) & PTGS2 (Cox2) inhibitor |
| 2 | 901.9 | — | DB08539/CHEMBL249736 | Inhibits human CDK2 |
| 3 | 866.2 | — | DB06964/CHEMBL252124 | HSP90 |
| 4 | 864.4 | — | DB08702/— | Metallo- |
| 5 | 826.7 | — | DB08396/— | Ig heavy chain V-III region CAM* |
| 6 | 809.3 | — | DB06908/CHEMBL209749 | PPAR |
| 7 | 790.5 | — | DB08483/CHEMBL47590 | Agonist activity at human PPAR |
| 8 | 773.8 | Flavoxate | DB01148/CHEMBL1493 | CHRM1/2 antagonist; PDE 4/7/8 inhibitor |
| 9 | 771.6 | — | DB07594/CHEMBL399530 | Inhibits HSP90 activity |
| 10 | 761.2 | Rolitetracycline | DB01301/CHEMBL1237046 | Inhibitor of 30S ribosomal protein S9 & 16S rRNA; Inhibits synthetic amyloid |
*pharmacological action unknown.
Figure 11Frequency of 100 probes selected by CPCAFE and also identified by VBPCAFE within the 100 top-ranked probes with largest values over 100 independent ensembles.(a) miRNA; and (b) mRNA. Red and black circles correspond to probes selected or not selected, respectively, by CPCAFE.
Figure 12Scatterplots between B (horizontal axis) and (vertical axis). Red and black open circles correspond to 100 features selected or not selected, respectively, by CPCAFE. Solid green lines indicate quadratic regression lines. (a) miRNA; and (b) mRNA.
Disease association of mRNAs selected by categorical regression-based FE and BAHSIC
|
|
|
|
|---|---|---|
|
| ||
| NM_011652 |
| Cardiomyopathy, dilated (5.49×10−7, 1.80×10−10); Cardiomyopathies (1.5×10−4, 9.19×10−5); Cardiomyopathy, hypertrophic (5.6×10−3, 4.74×10−10); Heart defects, congenital (1.90×10−2, —); Heart failure (—, 1.02×10−5); Cardiomyopathy, hypertrophic, familial (—, 1.02×10−5) |
| NM_019494 |
| Cardiomyopathy, hypertrophic (—, 3.05×10−2) |
| NM_001177374 |
| Heart Defects, congenital (8.64×10−4, —); Cardiomegaly (1.46×10−3, —) |
| NM_008093 |
| Heart defects, congenital (1.40×10−7, —); Heart septal defects (4.12×10−3, —); Hypertrophy, right ventricular (1.36×10−4, —); Cardiovascular abnormalities (1.49×10−3, —); Cardiomyopathy, dilated (8.14×10−3, —); Cardiomyopathies (9.27×10−3, —); Cardiomegaly (1.75×10−2, 7.60×10−3) |
| NM_011828 |
| Cardiomyopathies (3.87×10−3, —); |
|
| ||
| NM_009722 |
| Cardiomegaly (7×10−21, 1.33×10−3); Cardiomyopathy, dilated (2.68−12, 7.37−9); Cardiomyopathies (5.87×10−12, 9.69 × 10−4); Heart Failure (6.50 × 10−12, —); Cardiac Output, low (2.01 × 10−7, —); Arrhythmias, cardiac (8.85 ×10−7, —); Hypertrophy, left ventricular (3.03×10−6, 3.38×10−2); Myocardial reperfusion injury (2.33×10−5, —) Myocardial stunning (3.42×10−3, —); Atrial fibrillation (5.42×10−3, —); Myocardial infarction (6.11×10−3, —); Heart disease (2.38×10−2, 7.00×10−4); Ventricular dysfunction, left (2.74×10−2, 3.4×10−4); Cardiomyopathy, restrictive (—, 2.12×10−3); Myocardial ischemia (—, 2.91×10−3); Mitral valve insufficiency (—, 3.68×10−3); Cardiomyopathy, hypertrophic (—, 3.49×10−2) |
| NM_001164171 |
| (Already identified by CPCAFE) |
| NM_008725 |
| Heart Defects, congenital (2.53×10−55, 9.09×10−9); Cardiomegaly (1.11×10−28, —); Cardiomyopathies (7,57×10−17, —); Hypertrophy, left ventricular (9.16×10−12, 1.19×10−13); Cardiomyopathy, dilated (6.47×10−11, 2.41×10−8); Heart disease (9,72×10−8, 1.11×10−4); Endocardial cushion defects (3.51×10−6, —); Heart septal defects (4.63×10−6, —); Cardiovascular abnormalities (6.34×10−5, —); Tachycardia, ectopic atrial (1.97×10−4, —); Arrhythmias, cardiac (4.35×10−4, —); Cardiomyopathy, hypertrophic, familial (4.90×10−3, —); Heart septal defects, ventricular (5.70×10−3, —); Hypertrophy, right ventricular (1.04×10−2, —); Heart failure (1.04×10−2, 5.07×10−29); Cardiomyopathy, hypertrophic (2.20×10−2, 4.40×10−2) Ventricular dysfunction, left (—, 3.04×10−10); Myocardial reperfusion injury (— 1.76×10−7); Cardiovascular disease (—, 4.21×10−6); Myocardial infarction (—, 9.73×10−6); Mitral valve insufficiency (—, 1.04×10−5); Heart valve disease (—, 3.27×10−5); Myocardial ischemia (—, 1.49×10−4); Ventricular dysfunction (—,.198×10−3); Cardiomyopathy, restrictive (—, 2.68×10−2); Ventricular dysfunction, right (—, 3.53×10−3) |
| NM_008103 |
| Heart defects, congenital (6×10−3, —) |
| NM_009608 |
| Heart defects, congenital (2.54×10−38, —); Cardiomyopathy, dilated (6.55×10−9, 2.16×10−14); Heart septal defects (5.15×10−7, 6.80×10−4); Arrhythmias, cardiac (4.90×10−5, —); Cardiomyopathies (2.75×10−4, 1.23×10−2); Heart septal defects, atrial (1.33×10−3); Cardiovascular abnormalities (3.85×10−2, —); Cardiomegaly (4.45×10−2, 1.01×10−4); Cardiomyopathy, hypertrophic (—, 4.91×10−13); Cardiomyopathy, hypertrophic, familial (—, 1.97×10−6); Cardiomyopathy, restrictive (—, 5.87×10−4); Heart septal defects, atrial (—, 1.73×10−3); Hypertrophy, left ventricular (—, 1.09×10−2) |
| NM_013468 |
| Heart defects, congenital (3.98×10−12, 4.70×10−5); Cardiomegaly (2.52×10−6, 1.00×10−2); Cardiomyopathies (9.11×10−5, —); Heart septal defects (6.18×10−4, —); Cardiomyopathy, hypertrophic, familial (9.66×10−4, —); Cardiomyopathy, dilated (1.22×10−2, 1.22×10−2); Heart failure (—, 3.04×10−4); Hypertrophy, left ventricular (—, 7.55×10−3) |
| NM_011540 |
| (Already identified by CPCAFE) |
| NM_177369 |
| (Already identified by CPCAFE) |
| NM_007450 |
| (Already identified by CPCAFE) |
| NM_010174 |
| (Already identified by CPCAFE) |
| NM_008084 |
| (Already identified by CPCAFE) |
| NM_019494 |
| Already identified by categorical regression based FE |
| NM_013463 |
| Cardiomyopathy, hypertrophic (—, 2.46×10−2); Heart disease (—, 2.64×10−2); Cardiomyopathies (—, 3.10×10−2) |
| NM_001038592 |
| Myocardial reperfusion injury (—, 1.56×10−3) |
P-values obtained from the Gendoo server for both mouse and human. Genes annotated as “Already identified by CPCAFE” are listed in Table 5.
Figure 13Venn diagram of mRNAs extracted by CPCAFE and BAHSIC, and heart failure-related disease association. Numbers in parentheses correspond to those with outliers along PC2 (see Additional file 2).
Summary of studies with association between heart disease and KEGG pathways enriched by BAHSIC identified miRNA target genes
|
|
|
|
|
|---|---|---|---|
| Long-term depression | 5.37×10−19 | [ | Depression has been linked with additional health problems, including heart disease |
| Prion diseases | 3.13×10−17 | [ | Prion-induced amyloid heart disease with high blood infectivity in transgenic mice |
| Axon guidance | 5.99×10−14 | (Already identified by CPCAFE) | |
| Fatty acid biosynthesis | 2.01×10−12 | [ | A relationship between fatty acid synthase andcardiac calcium signaling has been suggested |
| Calcium signaling pathway | 2.04×10−10 | (Already identified by CPCAFE) | |
| Gap junction | 5.03×10−10 | [ | Gap junction alterations in human cardiac disease |
| Prostate cancer | 6.04×10−10 | (Already identified by CPCAFE) | |
| Pancreatic cancer | 1.63×10−9 | — | — |
| Glutamatergic synapse | 2.02×10−8 | — | — |
| Endometrial cancer | 2.02×10−8 | (Already identified by CPCAFE) | |
| Long-term potentiation | 2.65×10−8 | — | — |
| Neurotrophin signaling pathway | 2.81×10−8 | — | — |
| Focal adhesion | 8.34×10−8 | (Already identified by CPCAFE) | |
| TGF-beta signaling pathway | 2.00×10−7 | (Already identified by CPCAFE) | |
| Endocytosis | 2.27×10−7 | — | — |
| Cholinergic synapse | 9.60×10−7 | — | — |
| Regulation of actin cytoskeleton | 1.33×10−6 | — | — |
| Colorectal cancer | 3.52×10−6 | (Already identified by CPCAFE) | |
| Salmonella infection | 5.65×10−6 | — | — |
| PI3K-Akt signaling pathway | 1.01×10−5 | — | — |
Pathways annotated as “Already identified as CPCAFE” are listed in Table 3.
Figure 14Barplots comparing CPCAFE, categorical regression based FE, and BAHSIC. See text for details about panels included, from (a) to (l).
Figure 15Venn diagram of miRNAs extracted by CPCAFE and BAHSIC.
Figure 16Overall study work flow. The methods used for data processing and evaluation are indicated in red and blue, respectively. Dotted lines in sample descriptions (i.e., 1 or 2 day(s) stress vs 1 day rest) indicate missing control samples (see Methods for more detail).