| Literature DB >> 33888829 |
François Fauteux1, Anuradha Surendra2, Scott McComb3, Youlian Pan2, Jennifer J Hill4.
Abstract
Classification of tumors into subtypes can inform personalized approaches to treatment including the choice of targeted therapies. The two most common lung cancer histological subtypes, lung adenocarcinoma and lung squamous cell carcinoma, have been previously divided into transcriptional subtypes using microarray data, and corresponding signatures were subsequently used to classify RNA-seq data. Cross-platform unsupervised classification facilitates the identification of robust transcriptional subtypes by combining vast amounts of publicly available microarray and RNA-seq data. However, cross-platform classification is challenging because of intrinsic differences in data generated using the two gene expression profiling technologies. In this report, we show that robust gene expression subtypes can be identified in integrated data representing over 3500 normal and tumor lung samples profiled using two widely used platforms, Affymetrix HG-U133 Plus 2.0 Array and Illumina HiSeq RNA sequencing. We tested and analyzed consensus clustering for 384 combinations of data processing methods. The agreement between subtypes identified in single-platform and cross-platform normalized data was then evaluated using a variety of statistics. Results show that unsupervised learning can be achieved with combined microarray and RNA-seq data using selected preprocessing, cross-platform normalization, and unsupervised feature selection methods. Our analysis confirmed three lung adenocarcinoma transcriptional subtypes, but only two consistent subtypes in squamous cell carcinoma, as opposed to four subtypes previously identified. Further analysis showed that tumor subtypes were associated with distinct patterns of genomic alterations in genes coding for therapeutic targets. Importantly, by integrating quantitative proteomics data, we were able to identify tumor subtype biomarkers that effectively classify samples on the basis of both gene and protein expression. This study provides the basis for further integrative data analysis across gene and protein expression profiling platforms.Entities:
Mesh:
Substances:
Year: 2021 PMID: 33888829 PMCID: PMC8062554 DOI: 10.1038/s41598-021-88209-4
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1Overview of the workflow for selecting the best combination of data processing methods for cross-platform classification of lung cancer into tumor subtypes. In brief, microarray data were pre-processed using two methods, and RNA-seq data were pre-processed using six methods. Data from the two platform were combined and cross-platform normalization was performed using four methods. After filtering data by removing samples with low confidence regarding main class labels (LUAD, LUSC and normal lung), single-platform and cross-platform normalized data were submitted to unsupervised feature selection using eight methods, and then to consensus clustering. Clustering results were compared between single-platform and cross-platform normalized data using various statistics. Clustering results from the top-ranking combination of data processing methods were selected for a final round of supervised classification into tumor subtypes. Microarray pre-processing methods: mas5, microarray suite 5.0; rma, robust multi-array average. RNA-seq pre-processing methods: cpm, counts per million; fpkm, fragments per kilobase per million; normTransform, shifted logarithm transformation; rpkm, reads per kilobase per million; voom, variance modeling at the observational level; vst, variance stabilizing transformation. Cross-platform normalization methods: ComBat, empirical Bayes batch effect correction; FSQN, feature-specific quantile normalization; quantile, quantile normalization; TDM, training distribution matching. Unsupervised feature selection methods: disr, diversity-induced self-representation; lscore, Laplacian score; mad, median absolute deviation; mcfs, multi-cluster feature selection; specu, unsupervised spectral feature selection; spufs, structure preserving unsupervised feature selection; svde, singular value decomposition entropy; udfs, unsupervised discriminative features selection. Clustering statistics: entropy, platform entropy; ITCC, information theoretic clustering comparison; min(O/E), minimum observed to expected ratio; NbClust: optimal number of clusters; purity: maximum agreement between single and cross-platform data; randomness: platform randomness within clusters.
Figure 2Heatmap and hierarchical clustering of lung cancer and normal lung microarray and RNA-seq data. LNOR, normal (healthy) lung; LUAD-(1–3), lung adenocarcinoma subtypes 1–3; LUSC-(1–2), lung squamous cell carcinoma subtypes 1–2. This figure was produced using R version 4.0.4 (https://www.r-project.org/).
Focal amplifications and therapeutic targets in LUAD and LUSC subtypes.
| Class | Chromosome | Start | End | Gene |
|---|---|---|---|---|
| LUAD-1 | chr4 | 54249594 | 58387240 | KDR |
| LUAD-1 | chr7 | 54467979 | 56385413 | EGFR |
| LUAD-1 | chr8 | 38413296 | 38619413 | FGFR1 |
| LUAD-1 | chr12 | 25205851 | 25213599 | KRAS |
| LUAD-1 | chr17 | 39507734 | 39854986 | ERBB2 |
| LUAD-2 | chr4 | 54481387 | 55668902 | KDR |
| LUAD-2 | chr7 | 116699055 | 116705489 | MET |
| LUAD-2 | chr7 | 54714092 | 55576700 | EGFR |
| LUAD-2 | chr12 | 25181421 | 25209325 | KRAS |
| LUAD-2 | chr17 | 39725021 | 39761258 | ERBB2 |
| LUSC-1 | chr7 | 54751453 | 55698753 | EGFR |
| LUSC-2 | chr7 | 54699947 | 55357446 | EGFR |
Percentage of samples carrying somatic mutations in therapeutic targets in LUAD and LUSC subtypes.
| Gene | LUAD-1 | LUAD-2 | LUAD-3 | LUSC-1 | LUSC-2 |
|---|---|---|---|---|---|
| KRAS | 27.25 | 24.15 | 24.55 | 0.96 | 0 |
| EGFR | 3.89 | 9.12 | 12.93 | 2.03 | 1.13 |
| NTRK3 | 11.48 | 9.47 | 2.64 | 4.06 | 3.1 |
| KDR | 10.04 | 7.33 | 4.59 | 6.3 | 4.66 |
| PDGFRA | 7.38 | 8.76 | 4.32 | 4.16 | 2.83 |
| BRAF | 7.38 | 6.8 | 6.41 | 1.5 | 2.68 |
| ROS1 | 7.38 | 3.58 | 0.7 | 5.66 | 7.06 |
| ALK | 6.56 | 5.18 | 3.34 | 3.21 | 1.97 |
| NTRK2 | 5.94 | 2.68 | 2.09 | 2.46 | 1.69 |
| RET | 4.92 | 4.65 | 0.69 | 2.67 | 3.81 |
| PDGFRB | 4.3 | 2.68 | 1.25 | 2.14 | 1.41 |
| MET | 2.46 | 4.11 | 2.36 | 0.54 | 0.28 |
| NTRK1 | 3.28 | 1.79 | 1.25 | 1.6 | 2.96 |
| ERBB2 | 0 | 3.22 | 0.56 | 1.39 | 1.13 |
| FGFR1 | 0.82 | 0.53 | 0.28 | 0.74 | 0 |
Figure 3Biomarkers selected across three platforms for one-against-one classification of LUAD subtypes and normal lung. (A) Normal lung vs. LUAD-1; (B) normal lung vs. LUAD-2; (C) normal lung vs. LUAD-3; (D) LUAD-1 vs. LUAD-2; (E) LUAD-1 vs. LUAD-3; (F) LUAD-2 vs. LUAD-3. HGNC gene symbols are used to identify all biomarkers. This figure was produced using R version 4.0.4 (https://www.r-project.org/).
Functions used for preprocessing/normalization of microarray and RNA-seq data.
| Library | Function | Description | Reference |
|---|---|---|---|
| affy | mas5 | Microarray suite 5.0 | [ |
| affy | rma | Robust multi-array average | [ |
| edgeR | cpm | Counts per million | [ |
| edgeR | rpkm | Reads per kilobase per million | [ |
| DESeq2 | normTransform | Shifted logarithm transformation | [ |
| DESeq2 | vst | Variance stabilizing transformation | [ |
| DESeq2 | fpkm | Fragments per kilobase per million | [ |
| limma | voom | Variance modeling at the observational level | [ |
Functions used for cross-platform normalization of microarray and RNA-seq data.
| Library | Function | Description | Reference |
|---|---|---|---|
| FSQN | quantileNormalizeByFeature | Feature-specific quantile normalization | [ |
| TDM | tdm_transform | Training distribution matching | [ |
| sva | ComBat | Empirical Bayes batch effect correction | [ |
| preprocessCore | normalize.quantiles | Quantile normalization | [ |
Unsupervised feature selection methods used for clustering analysis.
| Library | Function | Description | Reference |
|---|---|---|---|
| Stats | mad | Median absolute deviation | [ |
| svde | Singular value decomposition entropy | [ | |
| Rdimtools | do.disr | Diversity-induced self-representation | [ |
| Rdimtools | do.lscore | Laplacian score | [ |
| Rdimtools | do.mcfs | Multi-cluster feature selection | [ |
| Rdimtools | do.specu | Unsupervised spectral feature selection | [ |
| Rdimtools | do.spufs | Structure preserving unsupervised feature selection | [ |
| Rdimtools | do.udfs | Unsupervised discriminative feature selection | [ |