| Literature DB >> 29157215 |
Daniel Castillo1, Juan Manuel Gálvez2, Luis Javier Herrera2, Belén San Román2, Fernando Rojas2, Ignacio Rojas2.
Abstract
BACKGROUND: Nowadays, many public repositories containing large microarray gene expression datasets are available. However, the problem lies in the fact that microarray technology are less powerful and accurate than more recent Next Generation Sequencing technologies, such as RNA-Seq. In any case, information from microarrays is truthful and robust, thus it can be exploited through the integration of microarray data with RNA-Seq data. Additionally, information extraction and acquisition of large number of samples in RNA-Seq still entails very high costs in terms of time and computational resources.This paper proposes a new model to find the gene signature of breast cancer cell lines through the integration of heterogeneous data from different breast cancer datasets, obtained from microarray and RNA-Seq technologies. Consequently, data integration is expected to provide a more robust statistical significance to the results obtained. Finally, a classification method is proposed in order to test the robustness of the Differentially Expressed Genes when unseen data is presented for diagnosis.Entities:
Keywords: Breast cancer; Cancer; Classification; Gene expression; Integration; Microarray; RNA-Seq; Random Forest; SVM; k-NN
Mesh:
Year: 2017 PMID: 29157215 PMCID: PMC5697344 DOI: 10.1186/s12859-017-1925-0
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Description of the training and test series considered with number of samples/outliers
| TRAINING SERIES | |||||
|---|---|---|---|---|---|
| Series | Platform | Technology | Quality samples | Excluded outliers | Samples origin |
| GSE52712 | Affymetrix | Microarray | 19 | 1 | Manchester (UK) |
| GSE40987 | Affymetrix | Microarray | 10 | 0 | Boston (USA) |
| GSE52262 | Affymetrix | Microarray | 16 | 0 | Houston (USA) |
| GSE12790 | Affymetrix | Microarray | 20 | 1 | San Francisco (USA) |
| GSE46834 | Illumina | Microarray | 8 | 0 | New York (USA) |
| GSE68651 | Illumina | Microarray | 35 | 1 | Southampton (UK) |
| GSE74251 | Illumina | RNA-Seq | 12 | 0 | Philadelphia (USA) |
| GSE74377 | Illumina | RNA-Seq | 12 | 0 | Iowa (USA) |
| TOTAL | Integrated | 132 | 3 | ||
| TEST SERIES | |||||
| Series | Platform | Technology | Quality samples | Excluded outliers | Samples origin |
| GSE78011 | Illumina | RNA-Seq | 3 | 0 | Louisville (USA) |
| GSE81593 | Illumina | RNA-Seq | 3 | 0 | New York (USA) |
| GSE75292 | Illumina | Microarray | 6 | 1 | Goyang (South Korea) |
| GSE29327 | Affymetrix | Microarray | 6 | 0 | South San Francisco (USA) |
| GSE30931 | Illumina | Microarray | 12 | 0 | Goettingen (Germany) |
| GSE48398 | Illumina | Microarray | 36 | 0 | Texas (USA) |
| GSE35928 | Affymetrix | Microarray | 6 | 0 | Piscataway (USA) |
| GSE57339 | Illumina | Microarray | 12 | 0 | New Haven (USA) |
| GSE45715 | Illumina | Microarray | 42 | 0 | Miami (USA) |
| TOTAL | Integrated | 126 | 1 | ||
Fig. 1Microarray gene expression pipeline
Fig. 2RNA-Seq gene expression integration pipeline
Fig. 3Integrated pipeline followed for this study
Fig. 4Expression profile of training and test datasets before normalization
Fig. 5Expression profile of training and test datasets after normalization
Fig. 6Intersection of expressed genes in RNA-Seq, microarray and the integrated dataset
Fig. 7Gene expression values boxplot for the set of 98 expressed genes. Figure shows significant differences between expression values for MCF7 and HS578T cancer cell lines and MCF10A non-cancer cell line
List of 98 expressed genes obtained with limma as the intersection of microarray, RNA-Seq and integrated dataset
| Genes names | ∣ | t | p-val | adj.p.val | B |
|---|---|---|---|---|---|
| KRT19 | 7.993 | 11.072 | 8.124E-21 | 2.449E-19 | 36.607 |
| KRT6A | -7.800 | -13.558 | 3.347E-27 | 2.503E-25 | 51.214 |
| NNMT | -7.584 | -11.544 | 4.951E-22 | 1.780E-20 | 39.384 |
| VIM | -7.261 | -15.117 | 3.917E-31 | 5.046E-29 | 60.213 |
| AKR1B1 | -6.943 | -11.437 | 9.357E-22 | 3.265E-20 | 38.753 |
| SFRP1 | -6.866 | -18.820 | 4.925E-40 | 1.904E-37 | 80.570 |
| TGFBI | -6.701 | -14.299 | 4.424E-29 | 4.174E-27 | 55.515 |
| MT1E | -6.650 | -15.281 | 1.537E-31 | 2.079E-29 | 61.142 |
| C3 | -6.569 | -15.928 | 3.857E-33 | 6.589E-31 | 64.805 |
| BMP7 | 6.406 | 13.058 | 6.330E-26 | 3.910E-24 | 48.292 |
| KRT5 | -6.229 | -9.125 | 7.460E-16 | 1.062E-14 | 25.273 |
| CXCL1 | -6.145 | -13.526 | 4.030E-27 | 2.986E-25 | 51.030 |
| S100A2 | -6.016 | -9.582 | 5.249E-17 | 9.014E-16 | 27.902 |
| KRT7 | -5.991 | -11.975 | 3.850E-23 | 1.643E-21 | 41.922 |
| TNS4 | -5.866 | -25.125 | 1.651E-53 | 3.829E-50 | 111.284 |
| EEF1A2 | 5.764 | 8.956 | 1.979E-15 | 2.656E-14 | 24.307 |
| CLMP | -5.631 | -11.238 | 3.037E-21 | 9.781E-20 | 37.583 |
| IFI16 | -5.543 | -9.230 | 4.073E-16 | 6.036E-15 | 25.872 |
| LAMC2 | -5.426 | -12.346 | 4.247E-24 | 2.015E-22 | 44.112 |
| IGFBP4 | 5.412 | 13.779 | 9.173E-28 | 7.406E-26 | 52.501 |
| FAM83A | -5.328 | -14.042 | 1.974E-28 | 1.741E-26 | 54.028 |
| SYTL2 | 5.283 | 11.883 | 6.617E-23 | 2.725E-21 | 41.384 |
| SNAI2 | -5.169 | -9.731 | 2.204E-17 | 4.010E-16 | 28.762 |
| DNER | -5.152 | -11.859 | 7.620E-23 | 3.114E-21 | 41.244 |
| PRKCDBP | -5.105 | -10.241 | 1.105E-18 | 2.434E-17 | 31.730 |
| ALOX15B | -5.088 | -16.524 | 1.353E-34 | 2.896E-32 | 68.133 |
| IGFBP5 | 5.085 | 8.165 | 1.755E-13 | 1.735E-12 | 19.871 |
| BNC1 | -5.072 | -16.335 | 3.889E-34 | 7.697E-32 | 67.085 |
| GFRA1 | 5.021 | 6.872 | 1.958E-10 | 1.223E-09 | 12.955 |
| DSC3 | -4.999 | -17.145 | 4.296E-36 | 1.181E-33 | 71.561 |
| PTGES | -4.990 | -17.489 | 6.479E-37 | 1.947E-34 | 73.440 |
| TFF1 | 4.925 | 4.857 | 3.168E-06 | 1.023E-05 | 3.497 |
| RAB25 | 4.864 | 8.521 | 2.368E-14 | 2.683E-13 | 21.851 |
| KRT14 | -4.863 | -6.445 | 1.768E-09 | 9.652E-09 | 10.794 |
| EFEMP1 | -4.855 | -10.020 | 4.059E-18 | 8.275E-17 | 30.440 |
| SLPI | -4.793 | -10.194 | 1.455E-18 | 3.128E-17 | 31.457 |
| SDPR | -4.728 | -12.002 | 3.264E-23 | 1.401E-21 | 42.086 |
| FBP1 | 4.707 | 6.789 | 3.017E-10 | 1.848E-09 | 12.530 |
| EPCAM | 4.662 | 8.150 | 1.906E-13 | 1.878E-12 | 19.790 |
| GNA15 | -4.570 | -15.676 | 1.614E-32 | 2.495E-30 | 63.382 |
| HTRA1 | -4.527 | -10.906 | 2.178E-20 | 6.152E-19 | 35.627 |
| RAC2 | -4.524 | -11.727 | 1.669E-22 | 6.433E-21 | 40.465 |
| CLCA2 | -4.411 | -9.272 | 3.189E-16 | 4.828E-15 | 26.115 |
| GPX1 | -4.384 | -6.773 | 3.281E-10 | 1.994E-09 | 12.448 |
| EMP3 | -4.383 | -9.299 | 2.728E-16 | 4.176E-15 | 26.269 |
| SERPINB5 | -4.371 | -8.314 | 7.600E-14 | 8.016E-13 | 20.698 |
| TSPYL5 | 4.317 | 6.297 | 3.735E-09 | 1.943E-08 | 10.062 |
| GSTP1 | -4.242 | -5.846 | 3.433E-08 | 1.523E-07 | 7.892 |
| SLC2A10 | 4.216 | 11.411 | 1.088E-21 | 3.782E-20 | 38.602 |
| LDHB | -4.182 | -5.892 | 2.745E-08 | 1.238E-07 | 8.111 |
| VSTM2L | -4.146 | -11.277 | 2.409E-21 | 7.852E-20 | 37.813 |
| BIRC3 | -4.079 | -13.064 | 6.110E-26 | 3.799E-24 | 48.327 |
| ABLIM3 | -4.000 | -12.337 | 4.481E-24 | 2.113E-22 | 44.059 |
| TFCP2L1 | -3.874 | -11.847 | 8.202E-23 | 3.344E-21 | 41.171 |
| DSG3 | -3.820 | -8.387 | 5.035E-14 | 5.469E-13 | 21.105 |
| SLC26A2 | -3.798 | -13.491 | 4.947E-27 | 3.632E-25 | 50.826 |
| C3orf14 | 3.763 | 7.772 | 1.558E-12 | 1.358E-11 | 17.715 |
| IL20RB | -3.667 | -8.868 | 3.262E-15 | 4.229E-14 | 23.812 |
| FXYD5 | -3.623 | -5.585 | 1.191E-07 | 4.882E-07 | 6.679 |
| GSTM3 | 3.590 | 9.622 | 4.161E-17 | 7.268E-16 | 28.133 |
| ADRB2 | -3.572 | -9.968 | 5.512E-18 | 1.099E-16 | 30.136 |
| EMP1 | -3.535 | -7.622 | 3.543E-12 | 2.907E-11 | 16.905 |
| IGFBP7 | -3.530 | -4.676 | 6.866E-06 | 2.104E-05 | 2.751 |
| GJB5 | -3.517 | -12.456 | 2.225E-24 | 1.097E-22 | 44.755 |
| HENMT1 | 3.514 | 7.953 | 5.732E-13 | 5.316E-12 | 18.702 |
| ZBED2 | -3.507 | -6.452 | 1.705E-09 | 9.338E-09 | 10.830 |
| MSLN | -3.504 | -8.558 | 1.917E-14 | 2.217E-13 | 22.061 |
| IL18 | -3.415 | -9.270 | 3.223E-16 | 4.864E-15 | 26.104 |
| TRIM29 | -3.395 | -9.588 | 5.081E-17 | 8.735E-16 | 27.934 |
| OSR2 | 3.346 | 8.380 | 5.238E-14 | 5.671E-13 | 21.066 |
| LAMB1 | -3.346 | -6.972 | 1.162E-10 | 7.510E-10 | 13.468 |
| UCP2 | 3.332 | 5.788 | 4.539E-08 | 1.979E-07 | 7.620 |
| CPVL | -3.331 | -7.870 | 9.043E-13 | 8.152E-12 | 18.253 |
| KRT81 | -3.320 | -5.133 | 9.424E-07 | 3.334E-06 | 4.670 |
| S100A8 | -3.292 | -5.698 | 6.982E-08 | 2.957E-07 | 7.200 |
| TP53I3 | -3.242 | -11.149 | 5.160E-21 | 1.589E-19 | 37.057 |
| FOXA1 | 3.226 | 5.576 | 1.241E-07 | 5.069E-07 | 6.640 |
| SLC24A3 | 3.211 | 6.190 | 6.356E-09 | 3.184E-08 | 9.541 |
| PNLIPRP3 | -3.200 | -7.998 | 4.470E-13 | 4.207E-12 | 18.948 |
| INHBB | 3.180 | 7.756 | 1.698E-12 | 1.468E-11 | 17.630 |
| RAB38 | -3.129 | -9.539 | 6.781E-17 | 1.137E-15 | 27.649 |
| ZBTB16 | -3.112 | -8.869 | 3.251E-15 | 4.217E-14 | 23.816 |
| PLD5 | -3.070 | -11.039 | 9.925E-21 | 2.960E-19 | 36.408 |
| DFNA5 | -3.047 | -7.565 | 4.835E-12 | 3.890E-11 | 16.599 |
| FKBP5 | -2.988 | -10.435 | 3.528E-19 | 8.458E-18 | 32.863 |
| CD109 | -2.986 | -7.196 | 3.541E-11 | 2.475E-10 | 14.637 |
| CASP1 | -2.955 | -6.388 | 2.367E-09 | 1.267E-08 | 10.509 |
| SULT1E1 | -2.903 | -7.749 | 1.763E-12 | 1.513E-11 | 17.594 |
| FAM174B | 2.779 | 5.557 | 1.353E-07 | 5.493E-07 | 6.555 |
| PDZK1IP1 | -2.752 | -7.028 | 8.611E-11 | 5.667E-10 | 13.743 |
| TNNI2 | -2.750 | -7.896 | 7.842E-13 | 7.133E-12 | 18.393 |
| CAV1 | -2.727 | -5.028 | 1.503E-06 | 5.131E-06 | 4.217 |
| IRX4 | -2.714 | -7.628 | 3.433E-12 | 2.825E-11 | 16.936 |
| KRT80 | 2.706 | 5.268 | 5.131E-07 | 1.895E-06 | 5.259 |
| FOXO1 | -2.649 | -8.921 | 2.408E-15 | 3.188E-14 | 24.113 |
| SNCA | -2.635 | -8.533 | 2.211E-14 | 2.526E-13 | 21.919 |
| TBL1X | 2.565 | 9.676 | 3.043E-17 | 5.434E-16 | 28.442 |
Fig. 8Hierarchical cluster using the 98 invariant expressed genes
Training and test classification accuracies for SVMs, RFs and k-NN algorithms
| 1 Gene | 6 Genes | 98 Genes | |
|---|---|---|---|
| Training accuracy | |||
| Support vector machines | 98.5% | 100% | 100% |
| Random forest | 97.8% | 99.2% | 100% |
| k-Nearest neighbor | 98.5% | 99.2% | 100% |
| Test accuracy | |||
| Support vector machines | 86.5% | 96.8% | 97.6% |
| Random forest | 82.3% | 87.4% | 97.4% |
| k-Nearest neighbor | 84.4% | 94.1% | 94.9% |
Fig. 9Validation and test classification results with SVM, RF and k-NN using the most relevant genes obtained by mRMR
Relationship of the top 6 expressed genes with breast cancer
| Gene symbol | Gene name | Relationship between protein and breast cancer |
|---|---|---|
| SFRP1 | Secreted frizzled-related protein 1 | Inhibition of SFRP1 increases the proliferation, migration and invasion of breast cancer cells. SFRP1 exerted this function by activating Wnt/ |
| GSTM3 | Glutathione S-transferase mu 3 | GSTM3 is suggested as an important modifier that impacts on individual susceptibility to develop breast cancer among premenopausal women [ |
| SULT1E1 | Gulfotransferase family 1E member 1 | SULT1E1 is an enzyme that catalyzes the sulfation of active 17 |
| MB | Myoglobin | MB plays a functional role in breast cancer progression by promoting the growth of fully oxygenated cells through the control of fatty acid homeostasis and lipogenesis [ |
| TRIM29 | Tripartite motif containing 29 | TRIM29 is considered a breast cancer tumor suppressor. Low TRIM29 expression in breast cancer is associated with more aggressive tumor features. Suppression of the oncogenic transcription factor TWIST1 expression is one mechanism suggested by which TRIM29 functions as a suppressor of breast cancer development [ |
| VSTM2L | V-set and transmembrane domain containing 2 like | Although VSTM2L is detected in breast cancer tissues, to date there are no relation between its expression and breast cancer development in the current literature. |
Fig. 10Hierarchical cluster over healthy and breast cancer samples using the top 6 genes
Fig. 11Average expression value boxplots of the six most relevant genes obtained in this study