| Literature DB >> 22369099 |
Swee Heng Toh1, Philip Prathipati, Efthimios Motakis, Chee Keong Kwoh, Surya Pavan Yenamandra, Vladimir A Kuznetsov.
Abstract
BACKGROUND: Lung cancer is the leading cause of cancer deaths in the world. The most common type of lung cancer is lung adenocarcinoma (AC). The genetic mechanisms of the early stages and lung AC progression steps are poorly understood. There is currently no clinically applicable gene test for the early diagnosis and AC aggressiveness. Among the major reasons for the lack of reliable diagnostic biomarkers are the extraordinary heterogeneity of the cancer cells, complex and poorly understudied interactions of the AC cells with adjacent tissue and immune system, gene variation across patient cohorts, measurement variability, small sample sizes and sub-optimal analytical methods. We suggest that gene expression profiling of the primary tumours and adjacent tissues (PT-AT) handled with a rational statistical and bioinformatics strategy of biomarker prediction and validation could provide significant progress in the identification of clinical biomarkers of AC. To minimise sample-to-sample variability, repeated multivariate measurements in the same object (organ or tissue, e.g. PT-AT in lung) across patients should be designed, but prediction and validation on the genome scale with small sample size is a great methodical challenge.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22369099 PMCID: PMC3377915 DOI: 10.1186/1471-2164-12-S3-S24
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Work flow of the discriminative feature selection method
Figure 2The classification accuracy before and after cross-normalization of the original MAS5-normalized data. Panel A depicts the improvement of the classification accuracy of selected, well-established lung AC gene markers. Panel B depicts the improvement of the classification accuracy of 5 gene signatures of lung AC [3].
Figure 3The influence of the Cross normalization in enhancing the separability of paired samples is illustrated for selected genes from well-established Lung AC gene markers (panel A) and a 5-gene lung AC signature[3]. The patients are rank ordered, based on the cross-normalized expression intensities of the tumor samples.
Figure 4Venn diagram analysis of the reproducibility of classifiers derived using two different Affymetrix platforms. The high degree of overlap is interesting, given the different microarray experimental designs (the U133A classifier was derived using paired design, while the U95A classifier was derived using un-paired design). The p-value, which reflects the significance of the overlap between the 2 sets, was estimated using Fisher’s exact test [54].
Figure 5Venn diagrams illustrating the degree of overlap between the statistically significant probe sets identified by the Modified Wilcoxon test, with the corresponding probe sets identified by other methods (WT, t-test, PAM and EDGE). Panel A: Venn diagram illustrating the overlap between the MWT classifier and the top 2,829 differentially expressed probe sets, rank ordered using FDR values and identified using canonical approaches (t-test and WT with FDR correction). Panel B: Venn diagram depicting the overlap between the MWT classifier and the top ranked probe sets identified using computationally intensive feature selection methods.
Figure 6Venn diagrams illustrating the overlap of the classifiers derived using EDGE, PAM, WT and ECD. Panel A: Venn diagram depicting the overlap of the U133A probe sets identified as highly discriminative features. Panel B: Venn diagram depicting the overlap between the RefSeq gene symbols corresponding to the above described U133A probe sets. The numbers in parentheses are the numbers of known gene symbols present in a given subset.
Distribution of the number of false classifications of the 27 paired samples.
| Up-regulated in tumours | 0 | 1628 | 1109 | 938 | 1087 | 1054 |
| 1 | 0 | 202 | 211 | 184 | 187 | |
| 2 | 0 | 27 | 73 | 20 | 14 | |
| 3 | 0 | 1 | 25 | 0 | 0 | |
| 4 | 0 | 0 | 6 | 0 | 0 | |
| 9 | 0 | 0 | 1 | 0 | 0 | |
| Down- regulated in tumours | 8 | 0 | 0 | 1 | 0 | 0 |
| 7 | 0 | 0 | 1 | 0 | 0 | |
| 6 | 0 | 0 | 2 | 0 | 0 | |
| 5 | 0 | 0 | 12 | 0 | 0 | |
| 4 | 0 | 3 | 29 | 2 | 2 | |
| 3 | 0 | 16 | 67 | 15 | 10 | |
| 2 | 0 | 64 | 118 | 68 | 64 | |
| 1 | 0 | 317 | 314 | 343 | 372 | |
| 0 | 1201 | 1090 | 1031 | 1110 | 1126 | |
| # of the probe sets with 2 or more misclassified pairs | 0 | 111 | 335 | 105 | 90 | |
| total # of probe sets | 2829 | 2829 | 2829 | 2829 | 2829 | |
| percent of probe sets with 2 or more misclassified pairs | 0 | 0.04 | 0.12 | 0.04 | 0.03 | |
Comparison of the classification accuracy of highly discriminative 2829 probe sets (selected using MWT on cross-normalized signal intensity values with a stringent 100% accuracy criteria and a bootstrap p-value cut-off<0.05) with top-level 2829 probe sets identified by standard Wilcoxon sign ranked test (WT), EDGE, PAM, and student’s t-test. While the ECD classifier was derived using cross-normalized dataset as input, the classifiers derived using PAM, EDGE, WT and t-test used the original MAS5-normalized data as input. However the classification accuracy in terms of the number of probe sets with 2 or more anomalous fold changes was estimated using the cross-normalized dataset.
Figure 7Heatmap depicting four gene clusters which perfectly separate the AC and the surround lung tissue samples. The similarity of the profiles was estimated using two-way hierarchical cluster analysis of the cross-normalized expression signal intensity values. The panel on the left of the heatmap shows the distribution of fold-changes of individual probe sets, ranked based on Euclidian distance metric. The distribution of the fold-change values specifies AC up- and down- regulated gene clusters.
Selected GO terms and selected genes illustrating the functional categories enriched in the clusters of discriminative genes derived using ECD (Figure 7). Several representative gene symbols are indicated in parentheses.
| Cluster 1 |
|---|
| •Inflammatory response [AGER, ALOX5, MYD88, SEPP1, SERPING1] |
| •Regulation of cytoskeleton organization [ABLIM1, DES, DST, NEDD9, PALLD, PRF1] |
| •Positive regulation of response to stimulus [C1QA, C1QB, C7, CADM1, CX3CL1, FABP4, FCER1G, MYD88, SERPING1, SLIT2] |
| •Wound healing [CD36, GNAQ, HBEGF, MYH10, SERPING1, THBD, VWF] |
| •Response to mechanical stimulus [BTG2,TIM2, CAV1,CCL2, MGP,TXNIP,TGFBR2, FOS] |
| •Negative regulation of cell proliferation [SFTPD, VSIG4] |
| •Regulation of locomotion [EGFL7,CLIC4,AGER, DLC1,ENPP2,EDN1, IL6, IL6ST, VCL] |
| •Complement activation [C1QA, C1QB, C7, SERPING1] |
| •Positive regulation of immune response [C1QA, C1QB, C7, CADM1, FCER1G, MYD88, SERPING1] |
| •Hormonal immune response[C1QA, C1QB, C7, CCL2, IL6, SERPING1] 1, FCER1G, MYD88, SERPING1] |
| •M phase of mitotic cell cycle [AURKA, BIRC5, BUB1,CCNB1, CENPE] |
| •Negative regulation of intracellular transport [BARD1, GSK3B, NCBP2, NF1, TACC3] |
| •Microtubule-based process [CEP250, DST] |
| •Response to DNA damage stimulus [CSNK1D, DDB1, NONO,POLD1, TOP2A] |
| •Spindle organization [AURKA, BUB1B, CKS2, TTK, TUBG1, ZWINT] |
| •Cellular protein localization [AP1B1, ARF5, ICMT, NUP62, TIMM13, TOM1L1] |
| •DNA repair [APEX1, CSNK1D, DDB1, HMGB2,PARP1,POLB, POLD1,TOP2A] |
| •Glycosylation[ALG6,FUT2,GALNT10,MGAT4B, OGT, STT3A, UGCGL1] |
| •Enzyme linked receptor protein signaling pathway [ADRB2, ARRB2,SPTBN1, TEK, TGFBR3, ZFP106] |
| •Metal ion homeostasis [AGTR1, C5AR1, RGN, S1PR4, TRPC6] |
| •Protein amino acid phosphorylation [AATK, ADRB2, TIE1, TTN, ULK2] |
| •Transmembrane receptor protein tyrosine kinase signaling pathway [ADRB2, CRYAB, SORBS1, TEK, ZFP106] |
| •Negative regulation of cell proliferation [ADAMTS8, AIF1, TENC1, TGFB1I1, TOB2] |
| •Collagen fibril organization [COL3A1, COL5A2] |
| •Glycolysis[ALDOA, ENO1, GAPDH, PKM2, TPI1] |
| •Carbohydrate catabolic process [ALDOA, ENO1, FUCA1, GAPDH, TPI1] |
| •Extracellular structure organization [AGRN, COL11A1, COL3A1, COL5A2, ERBB2] |
| •Collagen metabolic process [COL3A1, MMP1, MMP11, MMP7] |
| •Negative regulation of cell adhesion [CDKN2A, HNRNPAB, TGFBI] |
| •Generation of precursor metabolites and energy [ALDOA, ENO1, GAPDH, UQCRH, UQCRQ] |
| •Negative regulation of protein ubiquitination [PSMA1,PSMA2, PSMB2, PSMB3, PSMB4, PSMB5] |
Figure 8GeneGo disease association analysis. Sorting of the GO term enrichment is done by p-values. The p-value is estimated by GeneGo MetaCore ‘Statistically significant Diseases’ method. Yellow: cluster 1, blue: cluster 2, read: cluster 3, and green: cluster 4. Selected genes of the GO clusters are described in Table 2. Cluster 1 (top) includes genes of Inflammatory response, Regulation of cytoskeleton organization, Positive regulation of response to stimulus, Wound healing, Response to mechanical stimulus, Negative regulation of cell proliferation, Regulation of locomotion, Complement activation, Positive regulation of immune response, Hormonal immune response. These genes are suppressed in AC vs AT. Cluster 2 includes the genes of M phase of mitotic cell cycle, Negative regulation of intracellular transport, Microtubule-based process, Response to DNA damage stimulus, Spindle organization, Cellular protein localization DNA repair, Glycosylation. These genes are overexpressed in AC vs AT. Cluster 3 includes the genes of Enzyme linked receptor protein signalling pathway, Metal ion homeostasis, Protein amino acid phosphorylation, Transmembrane receptor protein tyrosine kinase signalling pathway, Negative regulation of cell proliferation. These genes are suppressed in AC vs AT. Cluster 4 includes the genes of Collagen fibril organization, Collagen metabolic process, Glycolysis, Carbohydrate catabolic process, Extracellular structure organization, Negative regulation of cell adhesion, Generation of precursor metabolites and energy, Negative regulation of protein ubiquitination. Cluster 4 includes the genes up-regulated in AC vs AT (Figure 7).
Figure 9GeneGo metabolic networks analysis. Sorting of the GO term enrichment is done by p-values. The p-value is estimated by GeneGo MetaCore 'Statistically Significant Networks' method (see Figure 7 for details).
Enriched UP_SEQ Features terms in common and unique sub-sets of PT-AT ECD-derived signature and Lung AC Meta-signature
| UP_SEQ_FEATURE | Count | % | P-Value | Fold enrichment | Benjamini | FDR |
|---|---|---|---|---|---|---|
| nucleotide phosphate-binding region:ATP | 59 | 13.3 | 1.60E-11 | 2.7 | 2.60E-08 | 2.70E-08 |
| mutagenesis site | 95 | 21.5 | 2.90E-11 | 2 | 2.40E-08 | 4.90E-08 |
| binding site: ATP | 36 | 8.1 | 4.20E-08 | 2.9 | 2.30E-05 | 7.00E-05 |
| domain: kinesin-motor | 9 | 2 | 5.20E-06 | 9.1 | 2.10E-03 | 8.70E-03 |
| signal peptide | 112 | 25.3 | 5.80E-06 | 1.5 | 1.90E-03 | 9.80E-03 |
| domain: MCM | 5 | 1.1 | 1.80E-05 | 27.1 | 4.80E-03 | 3.00E-02 |
| sequence variant | 317 | 71.7 | 2.30E-05 | 1.1 | 5.30E-03 | 3.80E-02 |
| domain: protein kinase | 27 | 6.1 | 3.50E-05 | 2.5 | 7.10E-03 | 5.90E-02 |
| active site: proton acceptor | 33 | 7.5 | 5.80E-05 | 2.2 | 1.00E-02 | 9.70E-02 |
| mutagenesis site | 271 | 15.4 | 3.70E-11 | 1.5 | 1.70E-07 | 6.90E-08 |
| cross-link: Glycyl lysine isopeptide (Lys-Gly) (interchain with G-Cter in ubiquitin) | 38 | 2.2 | 2.00E-05 | 2.1 | 4.40E-02 | 3.70E-02 |
| binding site:substrate | 50 | 2.8 | 2.90E-05 | 1.9 | 4.30E-02 | 5.40E-02 |
| domain: protein kinase | 66 | 8.3 | 4.00E-18 | 3.4 | 9.90E-15 | 7.00E-15 |
| nucleotide phosphate-binding region: ATP | 99 | 12.4 | 3.20E-17 | 2.5 | 4.00E-14 | 5.70E-14 |
| binding site: ATP | 69 | 8.7 | 1.20E-16 | 3.1 | 9.20E-14 | 2.00E-13 |
| mutagenesis site | 160 | 20.1 | 6.90E-16 | 1.9 | 4.20E-13 | 1.20E-12 |
| active site: proton acceptor | 71 | 8.9 | 2.60E-13 | 2.6 | 1.30E-10 | 4.60E-10 |
| nucleotide phosphate-binding region: GTP | 28 | 3.5 | 9.10E-05 | 2.3 | 3.70E-02 | 1.60E-01 |
Statistics of ‘mutagenesis site’ term in the common and the unique protein subsets found in the AC meta-signature and AC ECD signature
| Subsets | #Reported genes | # protein IDs | #Mutagenesis site | # proteins ID/# Mutagenesis site |
|---|---|---|---|---|
| Common genes | 455 | 499 | 98 | 0.20 |
| AC meta-signature only | 829 | 813 | 162 | 0.20 |
| ECD signature only | 1852 | 1664 | 258 | 0.16 |
| Total | 3136 | 2926 | 518 | 0.19 |
PT-AT discriminative set and its extended set are significantly enriched by gene markers reported as potential biomarkers for a diagnostic of the early stages of lung AC. The extended set was obtained by relaxing the default PT-AT ECD signature at misclassification error values to 3. More details in see Additional file 4.
| Early AC diagnostic signature genes from [ | Supporting references# | PT-AT extreme discriminative sign. (2829 genes) | Extended set (misclassification errors to 3) | EDGE |
|---|---|---|---|---|
| ATP10B | [ | YES | YES | NO |
| AURKA | [ | YES | YES | YES |
| CLDN5 | [ | YES | YES | YES |
| COL11A1 | [ | YES | YES | YES |
| DNAI2 | [ | NO | YES | NO |
| FABP6 | [ | NO | NO | NO |
| HIGD1B | [ | YES | YES | YES |
| ILF3 | [ | YES | YES | NO |
| IQCG | [ | NO | NO | NO |
| LRRC48 | [ | NO | NO | NO |
| LRRC50 | [ | NO | NO | NO |
| MCM6 | [ | YES | YES | NO |
| MUC4 | [ | NO | YES | NO |
| RARRES2 | [ | YES | YES | YES |
| RFTN1 | [ | YES | YES | NO |
| SCG5 | [ | YES | YES | NO |
| SCGB1A1 | [ | NO | YES | YES |
| SFTPA2 | [ | YES | YES | NO |
| SFTPB | [ | NO | YES | NO |
| SFTPC | [ | YES | YES | YES |
| TFPI2 | [ | YES | YES | YES |
| TM4SF4 | [ | NO | YES | NO |
| TOP2A | [ | YES | YES | YES |
| XAGE1A | [ | YES | YES | YES |
| XAGE1B | [ | no U133A probe sets | no U133A probe sets | no U133A probe sets |
| XAGE1C | [ | no U133A probe sets | no U133A probe sets | no U133A probe sets |
| XAGE1D | [ | no U133A probe sets | no U133A probe sets | no U133A probe sets |
| XAGE1E | [ | no U133A probe sets | no U133A probe sets | no U133A probe sets |
# - The references mentioned in this table correspond to the citations used in Additional file 4.
Figure10QRT-PCR validation of potential lung AC diagnostic biomarkers identified using MWT. The separability of the normal-Lung AC pairs of the potential lung AC gene markers identified using MWT on lung tissue QRT-PCR array is illustrated, before (A & C) and after (B & D) application of the cross-normalized expression procedure to SPP1 (A & B) and CENPA (C&D).
QRT-PCR validation of potential lung AC diagnostic biomarkers identified using PT-AT ECD.
| RefSeq gene (probe sets/primers) | WT | MWT | Sample pairs | Fold change (AC/N) | ||
|---|---|---|---|---|---|---|
| Dis. error | P-value | Dis. error | P-value | |||
| SPP1 (209875_s_at) | 1 | 6.28E-06 | 0 | 1.25E-05 | 28 | 16.90 |
| CENPA (204962_s_at) | 4 | 9.00E-05 | 0 | 1.25E-05 | 28 | 1.41 |
| SPP1 (primer) | 1 | 3.03E-05 | 0 | 4.07E-05 | 24 | 28.91 |
| CENPA (primer) | 1 | 2.35E-05 | 0 | 4.07E-05 | 24 | 15.25 |
Discrimination error (Dis. Error) is the number of misclassified pairs. Fold changes for microarray was estimated as the ratio of the mean expression values of AC tissue versus adjacent normal tissue in the same tumour sample. Fold change for QRT-PCR was estimated as a ratio of the mean CT values (which were normalized using Actin-B CT values as control) of lung cancer tissues versus adjacent normal tissues in the same tumour sample. For more details of expression and QRT-PCR data normalization procedures, please refer to the materials and methods section.