| Literature DB >> 30040079 |
Su Bin Lim1,2, Swee Jin Tan3, Wan-Teck Lim4,5,6, Chwee Teck Lim1,2,7,8.
Abstract
The Gene Expression Omnibus (GEO) database is an excellent public source of whole transcriptomic profiles of multiple cancers. The main challenge is the limited accessibility of such large-scale genomic data to people without a background in bioinformatics or computer science. This presents difficulties in data analysis, sharing and visualization. Here, we present an integrated bioinformatics pipeline and a normalized dataset that has been preprocessed using a robust statistical methodology; allowing others to perform large-scale meta-analysis, without having to conduct time-consuming data mining and statistical correction. Comprising 1,118 patient-derived samples, the normalized dataset includes primary non-small cell lung cancer (NSCLC) tumors and paired normal lung tissues from ten independent GEO datasets, facilitating differential expression analysis. The data has been merged, normalized, batch effect-corrected and filtered for genes with low variance via multiple open source R packages integrated into our workflow. Overall this dataset (with associated clinical metadata) better represents the diseased population and serves as a powerful tool for early predictive biomarker discovery.Entities:
Mesh:
Year: 2018 PMID: 30040079 PMCID: PMC6057440 DOI: 10.1038/sdata.2018.136
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Figure 1Study design.
Preprocessing of raw data from ten independent datasets was done for normalization, background correction and probe-to-gene mapping. The fRMA-normalized data were corrected for batch effect using ComBat method and filtered for genes with low variance across samples. Validation of our dataset was done with PCA analyses and similarity measurement using RNA-Seq-profiled samples. Statistical R packages used to develop this dataset are stated.
GSE accession number and number of samples for each phenotype.
| Dataset | Lung tissue | Microarray | Platform | |
|---|---|---|---|---|
| 1 | GSE10799 | 3 | 16 | Affymetrix Human Genome U133 Plus 2.0 Array |
| 2 | GSE12667 | 0 | 75 | Affymetrix Human Genome U133 Plus 2.0 Array |
| 3 | GSE50081 | 0 | 181 | Affymetrix Human Genome U133 Plus 2.0 Array |
| 4 | GSE31210 | 20 | 226 | Affymetrix Human Genome U133 Plus 2.0 Array |
| 5 | GSE18842 | 45 | 46 | Affymetrix Human Genome U133 Plus 2.0 Array |
| 6 | GSE10445 | 0 | 72 | Affymetrix Human Genome U133 Plus 2.0 Array |
| 7 | GSE33356 | 60 | 60 | Affymetrix Human Genome U133 Plus 2.0 Array |
| 8 | GSE19188 | 65 | 91 | Affymetrix Human Genome U133 Plus 2.0 Array |
| 9 | GSE28571 | 0 | 100 | Affymetrix Human Genome U133 Plus 2.0 Array |
| 10 | GSE10245 | 0 | 58 | Affymetrix Human Genome U133 Plus 2.0 Array |
| 193 | 925 | 1118 |
Figure 2Validity of our generated dataset.
(a) The effect of batch effect removal is clearly demonstrated using the plotMDS function. (b) The MDS plot of our merged microarray dataset shows a clear separation between different disease phenotypes (925 primary NSCLC tumors: red; 193 non-tumors: green). (c) The merging effect of the ComBat technique on the fRMA-normalized data is illustrated using the plotRLE function. (d) The local effect of the ComBat method at the gene-level is demonstrated using the plotGeneWiseBoxPlot function. A1BG gene was selected for the demonstration purpose.
Figure 3The interplatform concordance between microarray (normalized dataset) and RNA-Seq (TCGA) platforms in discovering DE genes for distinct subtypes of NSCLC.
(a) Linear regression lines (black line) and marginal histograms (blue) are drawn; rs=Spearman’s correlation coefficient. (b) DEG lists generated for adenocarcinoma and squamous cell carcinoma (SCC). logFC >1.5 and logFC >3 were used for statistical criteria to define DE genes for our normalized dataset and TCGA cohorts, respectively.