| Literature DB >> 26106884 |
Nisha Puthiyedth1, Carlos Riveros1, Regina Berretta1, Pablo Moscato1.
Abstract
BACKGROUND: The joint study of multiple datasets has become a common technique for increasing statistical power in detecting biomarkers obtained from smaller studies. The approach generally followed is based on the fact that as the total number of samples increases, we expect to have greater power to detect associations of interest. This methodology has been applied to genome-wide association and transcriptomic studies due to the availability of datasets in the public domain. While this approach is well established in biostatistics, the introduction of new combinatorial optimization models to address this issue has not been explored in depth. In this study, we introduce a new model for the integration of multiple datasets and we show its application in transcriptomics.Entities:
Mesh:
Year: 2015 PMID: 26106884 PMCID: PMC4480358 DOI: 10.1371/journal.pone.0127702
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Summary of datasets used in this study.
| Name | Plat | Series | NS | Norm | PT | Met | Probes | EF |
|---|---|---|---|---|---|---|---|---|
| Singh [ | Affymetrix [HG-U95Av2] | N/A | 102 | 50 | 52 | 0 | 12558 | 1519 |
| Welsh [ | Affymetrix [HG-U95Av2] | N/A | 55 | 9 | 25 | 21 | 12560 | 2429 |
| Uma [ | Affymetrix [HG-U95B] | E-GEOD-6919 | 80 | 17 | 63 | 0 | 37691 | 3484 |
| L-2695 [ | SHBB | GSE3933 | 26 | 9 | 13 | 4 | 44161 | 4288 |
| L-3044 [ | SHCQ | GSE3933 | 41 | 16 | 23 | 2 | 43009 | 4082 |
| L-3289 [ | SHBW | GSE3933 | 45 | 16 | 26 | 3 | 43009 | 4953 |
Name is the name assigned to the study throughout this paper. Plat is the platform details of each dataset. Series is the Gene Expression Omnibus Series identifier for the dataset. NS is the original number of samples in the study, of which Norm are the number of healthy tissue samples, PT are the number of primary tumour samples, Met is the number of metastasis samples present in each dataset, Probes is the number of probes present in each dataset, EF is the number of probes present after entropy filtering.
The results of the numerical solution of the (α,β)-k-Feature Set problem on each of the six individual datasets.
| Dataset | Feat.No | After EF |
|
|
|
|---|---|---|---|---|---|
| Singh | 12558 | 1519 | 215 | 329 | 754 |
| Welsh | 12560 | 2429 | 1188 | 1068 | 1768 |
| Uma | 37691 | 3484 | 881 | 1079 | 1857 |
| L-2695 | 44161 | 4288 | 2266 | 2421 | 3533 |
| L-3044 | 43009 | 4028 | 966 | 862 | 1800 |
| L-3289 | 43009 | 4953 | 1397 | 1216 | 2696 |
Dataset is the short name used in this paper for the dataset. Feat. No is the initial number of features (probes) present in the dataset, After EF is the number of features after applying entropy filtering, and are the values for the parameters α and β for any feasible solution, and is the number of probes in the resulting solution to the individual (α,β)-k-Feature Selection problem for the dataset. For method details refer to Materials and Methods.
List of common genes among all the individual dataset results from Table 2.
| Gene Symbol | Gene Name | Reference |
|---|---|---|
| EEF2 | Eukaryotic Translation Elongation Factor 2 | [ |
| SPG20 | Spastic Paraplegia 20 | No associated reference |
| ERG | Erythroblastosis Virus E26 Oncogene Homolog | [ |
| AMACR | Alpha-Methylacyl-CoA Racemase | [ |
| SOX4 | SRY (Sex determining Region Y)-box 4 | [ |
| APOC1 | Apolipoprotein C-I | [ |
| GUCY1A3 | Guanylate Cyclase 1, soluble, alpha 3 | [ |
Gene Symbol is the official gene symbols. Gene Name is the expanded gene name. Reference is the reference for each gene which shows the relation with prostate cancer.
t-test results on individual dataset.
| Dataset | Feat.No | Signature size |
|---|---|---|
| Singh | 1519 | 616 |
| Welsh | 2429 | 717 |
| Uma | 3484 | 690 |
| L-2695 | 4288 | 286 |
| L-3044 | 4028 | 654 |
| L-3289 | 4953 | 647 |
Dataset is the short name used in this paper for the dataset. Feat.No is the number of features (probes) present in the dataset before applying t-test, and Signature size is the number of genes in the resulting solution for each dataset. For method details refer to Section 2.
Result of Coloured (α,β)-k-Feature Set selection methodology.
| No of Datasets | No of Combined Probes | No of Genes |
|---|---|---|
| Four or more | 2272 | 327 |
| Five or more | 1806 | 186 |
| Six | 792 | 120 |
No of Datasets is the considered number of datasets to find the coverage. No of Combined Probes is the resulted number of features after applying Coloured (α,β)-k-Feature Set selection methodology and No of Genes is the number of genes corresponds to the number of combined probes.
Fig 1Heatmap for the Coloured (α,β)-k-Feature Selection resulted genes that cover five or more datasets.
It contains 186 up and down regulated genes (columns). The genes are ordered using a memetic algorithm introduced by Moscato et al. in [49]. The blocks of greenish blue colour represent the absence of gene values in particular datasets. The first colour bar at the right indicates Primary Tumour (blue) and Normal [13] samples. The second colour bar represents each sample group in different colour. L-2695 (blue), L-3044 (red), L-3289 (orange), Welsh (grey), Uma (cyan) and Singh (dark grey).
Fig 2Heatmap for the Coloured (α,β)-k-Feature Selection resulted genes that cover six datasets.
There are 120 up and down regulated genes (columns) which are differentially expressed between normal and tumour classes. The two colour bars at the right represent the ordering of samples and sample groups, respectively, as explained in Fig 1.
Overlapping genes in t-test, Coloured (α,β)-k and (α,β)-k-Feature Selection.
| Number of Datasets |
|
| Coloured |
|---|---|---|---|
| Six | 4 | 7 | 120 |
| Five or more | 22 | 57 | 327 |
| Four or more | 36 | 139 | 623 |
Number of Datasets shows the number of datasets considered to find the overlapping. gives the number of overlapping genes in t-test results for the considered datasets, gives the number of overlapping genes between individual (α,β)-k feature selection result for each case. Coloured gives the number of common genes in the result of Coloured (α,β)-k-feature selection considered case of datasets. For method details refer to Section 2.
Comparison of Coloured (α,β)-k-Feature Set problem result and RankProd result.
| RankProd | No of CABK resulted genes | |||||
|---|---|---|---|---|---|---|
| Dataset | No of genes as input | pfp Cut off | No of resulted genes | Six datasets (120) | Five datasets (327) | Four datasets(623) |
| Combined dataset | 6929 | 0.05 | 1883 | 80 | 169 | 260 |
| 6929 | 0.01 | 1484 | 58 | 140 | 214 | |
RankProd is the result of RankProd for Combined dataset with 0.05 and 0.01 pfp (percentage of false positive likelihood cut-off). No of CABK resulted genes is the number of genes resulted from Coloured (α,β)-k-Feature Set problem which covered six, five and more, four and more datasets.
Result of sensitivity analysis.
| Case a | Case b | ||||
|---|---|---|---|---|---|
| Exp-1 (1 gene) | Exp-2 (2 genes) | Exp-3 (5 genes) | Exp-4 (1 gene / 1DS) | Exp-5 (1 gene / 2DS) | |
| Average Signature Length | 3203.1 (28.68) | 3201.4 (28.21) | 3204.9 (9.48) | 3190.2 (0.45) | 3190.8(1.30) |
| Average % Overlap with Original | 97.42 (3.31) | 97.62 (3.03) | 98.04 (0.51) | 99.41 (0.52) | 99.15 (0.63) |
| Average Number of New Features | 46.1 (11.47) | 50.5 (17.67) | 77.3 (17.97) | 18.6 (16.80) | 27.6 (20.98) |
| Average Cover of New Features | 3.6 | 3.17 | 3.4 | 1.47 | 1.4 |
| Average Signature length variation | 0.41% | 0.36% | 0.47% | 0.01% | 0.03% |
Case a is the result of sensitivity analysis after removing one gene (Exp-1), two genes (Exp-2) and five genes (Exp-3) from the combined dataset. Case b gives the result of sensitivity analysis after removing one gene from one (Exp-4) and two (Exp-5) individual datasets. For more details refer to Section 2. Values in parenthesis are the standard deviations for the 10 repetitions (Case a) and 5 repetitions (Case b).
The top 14 resulted pathways from pathway analysis.
| Pathway name | Pathway Classification | P-value | Reference |
|---|---|---|---|
| Integrin signalling pathway | Cell communication | 1.03E-08 | [ |
| Smooth Muscle Contraction | Organismal Systems; Circulatory system | 5.98E-08 | [ |
| Oxytocin signalling pathway | Organismal Systems; Endocrine system | 1.23E-08 | [ |
| Collagen biosynthesis and modifying enzymes | Metabolism; Amino acid metabolism | 1.13E-07 | [ |
| Axon guidance | Development | 1.34E-06 | [ |
| Gap junction trafficking | Cell communication | 3.12E-06 | [ |
| Protein digestion and absorption | Organismal Systems; Digestive system | 3.6E-06 | [ |
| Ras activation | Regulation of translation and transcription | 3.46E-05 | [ |
| regulation of pgc-1a | Cell motility | 3.61E-05 | [ |
| Assembly of collagen fibrils and other multimeric structures | Metabolism | 3.68E-05 | [ |
| CREB phosphorylation | Metabolism; Energy metabolism | 6.31E-05 | [ |
| Syndecan-1-mediated signalling events | Genetic Information Processing | 6.3E-05 | [ |
| NCAM1 interactions | Signal Transduction | 6.72E-05 | [ |
| regulators of bone mineralization | Metabolism | 6.7E-05 | [ |
Pathway Name is the name of the pathways. Pathway Classification is the class of each pathway. P-value is the respective p-value for each pathway. Reference is the papers which show the relation of each pathway with prostate cancer.