| Literature DB >> 28241739 |
Florian Rohart1, Aida Eslami2, Nicholas Matigian1, Stéphanie Bougeard3, Kim-Anh Lê Cao4.
Abstract
BACKGROUND: Molecular signatures identified from high-throughput transcriptomic studies often have poor reliability and fail to reproduce across studies. One solution is to combine independent studies into a single integrative analysis, additionally increasing sample size. However, the different protocols and technological platforms across transcriptomic studies produce unwanted systematic variation that strongly confounds the integrative analysis results. When studies aim to discriminate an outcome of interest, the common approach is a sequential two-step procedure; unwanted systematic variation removal techniques are applied prior to classification methods.Entities:
Keywords: Algorithm; Classification; Integration; Multivariate; Partial-least-square; Transcriptome analysis
Mesh:
Year: 2017 PMID: 28241739 PMCID: PMC5327533 DOI: 10.1186/s12859-017-1553-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
21].
Stem cells experimental design
| Experiment | Platform | Fib | hESC | hiPSC |
|---|---|---|---|---|
| Bock | Affymetrix HT-HG-U133A | 6 | 20 | 12 |
| Briggs | Illumina HumanHT-12 V4 | 18 | 3 | 30 |
| Chung | Affymetrix HuGene-1.0-ST V1 | 3 | 8 | 10 |
| Ebert | Affymetrix HG-U133 Plus2 | 2 | 5 | 3 |
| Guenther | Affymetrix HG-U133 Plus2 | 2 | 17 | 20 |
| Maherali | Affymetrix HG-U133 Plus2 | 3 | 3 | 15 |
| Marchetto | Affymetrix HuGene-1.0-ST V1 | 6 | 3 | 12 |
| Takahashi | Agilent SurePrint G3 GE 8x60K | 3 | 3 | 3 |
| Total training set | 5 platforms | 43 | 62 | 105 |
| Andrade | Affymetrix HuGene-1.0-ST V1 | 3 | 6 | 15 |
| Hu | Affymetrix HG-U133 Plus2 | 1 | 5 | 12 |
| Kim | Affymetrix HG-U133 Plus2 | 1 | 1 | 3 |
| Loewer | Affymetrix HG-U133 Plus2 | 4 | 2 | 7 |
| Si-Tayeb | Affymetrix HG-U133 Plus2 | 3 | 6 | 6 |
| Vitale | Illumina HumanHT-12 V4 | 8 | 3 | 18 |
| Yu | Affymetrix HG-U133 Plus2 | 2 | 10 | 16 |
| Total test set | 3 platforms | 22 | 33 | 77 |
A total of 15 studies were analysed, including three human cell types, human Fibroblasts (Fib), human Embryonic Stem Cells (hESC) and human induced Pluripotent Stem Cells (hiPSC) across five different types of microarray platforms. Eight studies from five microarray platforms were considered as a training set [57–64] and seven independent studies from three of the five platforms were considered as a test set [65–71]
Experimental design of four breast cancer cohorts including 4 cancer subtypes: Basal, HER2, Luminal A (LumA) and Luminal B (LumB)
| Experiment | Platform | Basal | Her2 | LumA | LumB |
|---|---|---|---|---|---|
| METABRIC Discovery | Illumina HT-12 v3 | 118 | 87 | 466 | 268 |
| METABRIC Validation | Illumina HT-12 v3 | 213 | 153 | 255 | 224 |
| TCGA RNA-seq | illumina HiSeq 2000 | 188 | 80 | 549 | 213 |
| Total training set | 2 platforms | 519 | 320 | 1270 | 705 |
| TCGA microarray | Agilent custom 244K | 57 | 31 | 99 | 67 |
| Total test set | 1 platform | 57 | 31 | 99 | 67 |
Fig. 1Stem cell study. a PCA on the concatenated data: a greater study variation than a cell type variation is observed. b PLSDA on the concatenated data clustered Fibroblasts only. c MINT sample plot shows that each cell type is well clustered, d MINT performance: BER and classification accuracy for each cell type and each study
Fig. 2Classification accuracy for both training and test set for the stem cells and breast cancer studies (excluding PAM50 genes). The classification Balanced Error Rates (BER) are reported for all sixteen methods compared with MINT (in black)
Fig. 3MINT study-specific sample plots showing the projection of samples from a METABRIC Discovery, b METABRIC Validation and c TCGA-RNA-seq experiments, in the same subspace spanned by the first two MINT components. The same subspace is also used to plot the (d) overall (integrated) data. e Balanced Error Rate and classification accuracy for each study and breast cancer subtype from the MINT analysis