| Literature DB >> 19173745 |
Wei Kong1, Xiaoyang Mou, Qingzhong Liu, Zhongxue Chen, Charles R Vanderburg, Jack T Rogers, Xudong Huang.
Abstract
BACKGROUND: Gene microarray technology is an effective tool to investigate the simultaneous activity of multiple cellular pathways from hundreds to thousands of genes. However, because data in the colossal amounts generated by DNA microarray technology are usually complex, noisy, high-dimensional, and often hindered by low statistical power, their exploitation is difficult. To overcome these problems, two kinds of unsupervised analysis methods for microarray data: principal component analysis (PCA) and independent component analysis (ICA) have been developed to accomplish the task. PCA projects the data into a new space spanned by the principal components that are mutually orthonormal to each other. The constraint of mutual orthogonality and second-order statistics technique within PCA algorithms, however, may not be applied to the biological systems studied. Extracting and characterizing the most informative features of the biological signals, however, require higher-order statistics.Entities:
Year: 2009 PMID: 19173745 PMCID: PMC2646728 DOI: 10.1186/1750-1326-4-5
Source DB: PubMed Journal: Mol Neurodegener ISSN: 1750-1326 Impact factor: 14.195
Figure 1ICA decomposition results of AD microarray gene expression data. FastICA was applied to the AD microarray data matrix X with 13 samples and 3617 genes. Using the ICA method, X = AS, FastICA algorithm decomposes matrix X (13 × 3617) into latent variable matrix A (13 × 13) and gene signature matrix S (13 × 3617).
Figure 2Hierarchical clustering of the ICA and PCA outputs. (A) Hierarchical clustering of the ICA outputs with the last two 'common' components of matrix A removed. To display the cluster dendrogram conveniently, we transposed matrix A in the graph. That is, the columns of the graphical representation correspond to the rows of matrix A, denoting samples, and the rows of the graphical representation correspond to the columns of matrix A, denoting the ICA latent variables. (B) Hierarchical clustering of the principle components, with the number of the principle components k = 10. Similarly, the rows and columns of the graph denote the principle component and samples, respectively.
Figure 3Unsupervised hierarchical clustering of the normalized raw data. (A) Data reconstructed by PCA; (B) and the data reconstructed by FastICA; (C) C1–8: control samples, AD1–5: severe AD samples. Red and green blocks represent signal increase and decrease from the mean respectively. For the PCA reconstructed data, the first 10 principle components were applied and their cumulative contribution of the corresponding eigenvalues was 95.5%. For ICA-derived data, the genes with loadings that exceed the threshold (= 2) were considered significant, and the remaining genes with lower values were considered as noises and set to zero. Here, by comparing them to the original data, both PCA and ICA-derived data greatly improved the clustering results of AD microarray data.
Figure 4The first 10 PCs and ICs extracted by PCA and ICA methods respectively. (A) The first 10 PCs obtained by PCA that capture 95.5% information of original AD microarray data. (B) The first 10 ICs (gene signatures) uncovered by ICA from AD microarray data in which only a few relevant genes were significantly affected, leaving the majority of genes unaffected. The x-axis in each graph denotes genes, and the y-axis represents relative signal intensity.
Figure 5The corresponding histograms of the first 10 PCs (A) and ICs (B) in figure 4. The histograms of ICs in (B) displayed more super-Gaussian than did that of the PCs (A). ICA extracted sparser gene signatures, and, since each of gene signature only affects a relatively small percentage of all genes, we can expect that ICA found more significant genes related to AD.
Figure 6Hinton diagram representation of latent variable matrix . The size of each square corresponds to the amount aof component m in sample n. White and black represent positive and negative values, respectively.
Figure 7The 4-th (A) and 5-th (B) gene signatures (corresponding to the 4-th and 5-th column of matrix . Genes with loadings that exceed the chosen threshold (red line) were considered significant Here, the threshold = 2. The positive and negative loadings correspond to up- and down-regulation of expression, respectively.
Figure 8The selected significant genes for 4-th (A) and 5-th (B) gene signatures. Here, the threshold = 2. For reconstructing the gene expression profile, all the items whose absolute values in matrix were less than this threshold were set to zero.
Selected genes up-regulated in severe AD
| AMIGO2 | adhesion molecule with Ig-like domain 2 | chr12q13.11 |
| BTG1 | B-cell translocation gene 1, anti-proliferative | chr12q22 |
| CD24 | CD24 molecule | chr6q21 |
| CD44 | CD44 molecule (Indian blood group) | chr11p13 |
| CDC42EP4 | CDC42 effector protein (Rho GTPase binding) 4 | chr17q24-q25 |
| IFITM1 | interferon-induced transmembrane protein 1 (9–27) | chr11p15.5 |
| IFITM2 | interferon-induced transmembrane protein 2 (1–8D) | chr11p15.5 |
| IRF7 | interferon regulatory factor 7 | chr11p15.5 |
| IFI44L | interferon-induced protein 44-like | chr1p31.1 |
| IL4R | interleukin 4 receptor | chr16p12.1-p11.2 |
| IRAK1 | interleukin-1 receptor-associated kinase 1 | chrXq28 |
| NFKBIA | nuclear factor of kappa light polypeptide gene enhancer in B-cells inhibitor, alpha | chr14q13 |
| CAMK2B | calcium/calmodulin-dependent protein kinase (CaM kinase) II beta | chr22q12|7p14.3-p14.1 |
| CALM1 | calmodulin 1 (phosphorylase kinase, delta) | chr14q24-q31 |
| CAPZA2 | capping protein (actin filament) muscle Z-line, alpha 2 | chr7q31.2-q31.3 |
| CHGB | chromogranin B (secretogranin 1) | chr20pter-p12 |
| LOC728320/LTF | lactotransferrin/similar to lactotransferrin | chr3q21-q23 |
| MPPE1 | metallophosphoesterase 1 | chr18p11.21 |
| MT1F | metallothionein 1F | chr16q13 |
| MT1M | metallothionein 1M | chr16q13 |
| MBP | myelin basic protein | chr18q23 |
| SCGN | secretagogin, EF-hand calcium binding protein | chr6p22.3-p22.1 |
| SLC24A3 | solute carrier family 24(sodium/potassium/calcium exchanger), member 3 | chr20p13 |
| SLC7A11 | solute carrier family 7, (cationic amino acid transporter, y+ system) member 11 | chr4q28-q32 |
| ZIC1 | zinc family member 1 (odd-paired homolog, Drosophila) | chr3q24 |
| ZBTB20 | zinc finger and BTB domain containing 20 | chr3q13.2 |
| ZNF500 | zinc finger protein 500 | chr16p13.3 |
| ZNF580 | zinc finger protein 580 | chr19q13.42 |
| ZNF652 | zinc finger protein 652 | chr17q21.32 |
| ZNF710 | zinc finger protein 710 | chr15q26.1 |
| NMB | neuromedin B | chr15q22-qter |
| LOC644166/LO C644191/LOC728937/RPS26 | ribosomal protein S26/similar to 40S ribosomal protein S26 | chr12q13/chr17q21.31/chr2q31.1/chr4q26 |
| SORBS3 | sorbin and SH3 domain containing 3 | chr8p21.3 |
| COL21A1 | collagen, type XXI, alpha 1 | chr6p12.3-p11.2|6p12.3-p11.2 |
| CTBP1 | C-terminal binding protein 1 | chr4p16 |
| CAPZA2 | capping protein (actin filament) muscle Z-line, alpha 2 | chr7q31.2-q31.3 |
| FLNA | filamin A, alpha (actin binding protein 280) | chrXq28 |
| APOC2/APOC4 | apolipoprotein C-II/apolipoprotein C-IV | chr19q13.2 |
| APOE | apolipoprotein E | chr19q13.2 |
| ABCA1 | ATP-binding cassette, sub-family A (ABC1), member 1 | chr9q31.1 |
| GAD2 | glutamate decarboxylase 2 (pancreatic islets and brain, 65 kDa) | chr10p11.23 |
| LDLRAP1 | low density lipoprotein receptor adaptor protein 1 | chr1p36-p35 |
| AEBP1 | AE binding protein 1 | chr7p13 |
| TAP1 | transporter 1, ATP-binding cassette, sub-family B (MDR/TAP) | chr6p21.3 |
| UBAP2L | ubiquitin associated protein 2-like | chr1q21.3 |
| HLA-DRB4 | major histocompatibility complex, class II, DR beta 4 | chr6p21.3 |
| TRHDE | thyrotropin-releasing hormone degrading enzyme | chr12q15-q21 |
| TMEM92 | Transmembrane protein 92 | chr17q21.33 |
| SERPINA3 | serpin peptidase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 3 | chr14q32.1 |
| CDKN1C | cyclin-dependent kinase inhibitor 1C (p57, Kip2) | chr11p15.5 |
| GSTM5 | glutathione S-transferase M5 | chr1p13.3 |
| SPARC | secreted protein, acidic, cysteine-rich (osteonectin) | chr5q31.3-q32 |
Selected genes down-regulated in severe AD
| CD22/MAG | CD22 molecule/myelin associated glycoprotein | chr19q13.1 |
| CABP1 | calcium-binding protein 1 | chr12q24.31 |
| CACNG3 | calcium channel, voltage-dependent, gamma subunit 3 | chr16p12-p13.1 |
| CAMK2B | calcium/calmodulin-dependent protein kinase (CaM kinase) II beta | chr22q12|7p14.3-p14.1 |
| CAMK1G | calcium/calmodulin-dependent protein kinase IG | chr1q32-q41 |
| CAPZB | capping protein (actin filament) muscle Z-line, beta | chr1p36.1 |
| MET | met proto-oncogene (hepatocyte growth factor receptor) | chr7q31 |
| ZNF365 | zinc finger protein 365 | chr10q21.2 |
| TFRC | transferrin receptor (p90, CD71) | chr3q29 |
| APLP2 | amyloid beta (A4) precursor-like protein 2 | chr11q23-q25|11 q24 |
| CYP26B1 | cytochrome P450, family 26, subfamily B, polypeptide 1 | chr2p13.3 |
| NEFH | neurofilament, heavy polypeptide 200 kDa | chr22q12.2 |
| NEFL | neurofilament, light polypeptide 68 kDa | chr8p21 |
| NPY | neuropeptide Y | chr7p15.1 |
| NTRK2 | neurotrophic tyrosine kinase, receptor, type 2 | chr9q22.1 |
| SERPINI1 | serpin peptidase inhibitor, clade I (neuroserpin), member 1 | chr3q26.1 |
| OLIG2 | oligodendrocyte lineage transcription factor 2 | chr21q22.11 |
| NRSN2 | neurensin 2 | chr20p13 |
| CSPG5 | chondroitin sulfate proteoglycan 5 (neuroglycan C) | chr3p21.3 |
| C1orf115 | chromosome 1 open reading frame 115 | chr1q41 |
| C20orf149 | chromosome 20 open reading frame 149 | chr20q13.33 |
| C9orf16 | chromosome 9 open reading frame 16 | chr9q34.1 |
| HNRPA3/HNRPA3P1 | heterogeneous nuclear ribonucleoprotein A3 pseudogene 1/heterogeneous nuclear ribonucleoprotein A3 | chr10q11.21/chr2q31.2 |
| ACTB | actin, beta | chr7p15-p12 |
| SMARCA4 | SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily a, member 4 | chr19p13.2 |
| ABCA2 | ATP-binding cassette, sub-family A (ABC1), member 2 | chr9q34 |
| ATP6V0C | ATPase, H+ transporting, lysosomal 16 kDa, V0 subunit c | chr16p13.3 |
| ATP13A2 | ATPase type 13A2 | chr1p36 |
| BCAS1 | breast carcinoma amplified sequence 1 | chr20q13.2-q13. 3 |
| CABP1 | calcium-binding protein 1 | chr12q24.31 |
| RIMS3 | regulating synaptic membrane exocytosis 3 | chr1pter-p22.2 |
| PCSK1 | proprotein convertase subtilisin/kexin type 1 | chr5q15-q21 |
| RIMS2 | regulating synaptic membrane exocytosis 2 | chr8q22.3 |
| GRIN1 | glutamate receptor, ionotropic, N-methyl D-aspartate 1 | chr9q34.3 |
| MBP | myelin basic protein | chr18q23 |
| MOBP | myelin-associated oligodendrocyte basic protein | chr3p22.1 |
| PIP3-E | phosphoinositide-binding protein PIP3-E | chr6q25.2 |
| PLD3 | phospholipase D family, member 3 | chr19q13.2 |
| PTPRT | protein tyrosine phosphatase, receptor type, T | chr20q12-q13 |
| EIF5A | eukaryotic translation initiation factor 5A | chr17p13-p12 |
| ISG15 | ISG15 ubiquitin-like modifier | chr1p36.33 |
| RCAN2 | regulator of calcineurin 2 | chr6p12.3 |
| RGS4 | regulator of G-protein signaling 4 | chr1q23.3 |
| SRD5A1 | steroid-5-alpha-reductase, alpha polypeptide 1 (3-oxo-5 alpha-steroid delta 4-dehydrogenase alpha 1) | chr5p15 |
Figure 9The number of the significant genes on each chromosome selected by the ICA, PCA and SVM-RFE methods, respectively. The ICA method extracted more significant genes on chromosome 1, 3, 4, 7, 8, 9, 11, 12, 14, 17, 18, 19, 20, 21 and 22. Especially, the genes number on chromosome 1, 3, 7 and 20 extracted by ICA were significantly higher than those obtained by PCA and SVM-RFE.
Figure 10ICA vector model of microarray gene expression data. Matrix denotes the microarray gene expression data with m genes under n samples or conditions. The columns of A = [a1, a2,..., a] are the n × n latent vectors of the gene microarray data, where denotes the n × m gene signature matrix or expression mode, in which, the rows of are statistically independent to each other. Each gene profile provided x(t) obtained by microarray technology that was considered to be a linear combination of statistically independent components that have specific biological interpretations and latent variables .
Figure 11Theoretical framework of ICA algorithms on microarray gene expression data. In the ICA model for microarray gene expression data, the matrix is the only one we know. According to the hypothesis (left frame) that gene profiles are linear combinations of statistically independent components and the latent variables , the demixing process of ICA (right frame) can be applied to extract the latent variables (once we get the demixing matrix , we can obtain latent variables easily by = -1) and gene signatures that have specific biological interpretations (is the estimator of ).