| Literature DB >> 27848968 |
Ivan V Ozerov1, Ksenia V Lezhnina1, Evgeny Izumchenko2, Artem V Artemov1, Sergey Medintsev1, Quentin Vanhaelen1, Alexander Aliper1,3, Jan Vijg4, Andreyan N Osipov1,3, Ivan Labat5, Michael D West5, Anton Buzdin1,3,6, Charles R Cantor7, Yuri Nikolsky1,8, Nikolay Borisov1,3,6, Irina Irincheeva9, Edward Khokhlovich10, David Sidransky2, Miguel Luiz Camargo10, Alex Zhavoronkov1,3,11.
Abstract
Signalling pathway activation analysis is a powerful approach for extracting biologically relevant features from large-scale transcriptomic and proteomic data. However, modern pathway-based methods often fail to provide stable pathway signatures of a specific phenotype or reliable disease biomarkers. In the present study, we introduce the in silico Pathway Activation Network Decomposition Analysis (iPANDA) as a scalable robust method for biomarker identification using gene expression data. The iPANDA method combines precalculated gene coexpression data with gene importance factors based on the degree of differential gene expression and pathway topology decomposition for obtaining pathway activation scores. Using Microarray Analysis Quality Control (MAQC) data sets and pretreatment data on Taxol-based neoadjuvant breast cancer therapy from multiple sources, we demonstrate that iPANDA provides significant noise reduction in transcriptomic data and identifies highly robust sets of biologically relevant pathway signatures. We successfully apply iPANDA for stratifying breast cancer patients according to their sensitivity to neoadjuvant therapy.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27848968 PMCID: PMC5116087 DOI: 10.1038/ncomms13427
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Figure 1The general scheme of iPANDA calculation pipeline.
Fold changes between the gene-expression levels in the samples under investigation, and an average expression level of samples within the normal set serves as input data for the iPANDA algorithm. The major steps of iPANDA algorithm include estimation of statistical weights (1), co-expression-based grouping of genes into modules (2), estimation of topological weights (3) and calculation of iPANDA pathway activation scores (4).
Figure 2Sample-wise similarity between data obtained using various profiling platforms.
Pearson sample-wise correlation coefficients between gene expression levels (differential genes only are used with group t-test P value <0.05) obtained with Affymetrix and Agilent platforms for the same set of samples are shown in blue. Pearson sample-wise correlations between corresponding pathway activation values calculated using iPANDA are shown in yellow. Dashed and dotted lines represent, respectively, the median with upper and lower quartiles of the empirical distribution. Gene expression data was obtained from MicroArray Quality Control (MAQC) data set (GEO identifier GSE5350). Application of iPANDA leads to higher correlation between the data obtained using different experimental platforms for the same samples.
Figure 3Receiver operating characteristic AUC values for 30 highest rated by AUC pathway markers.
Pathway markers of responders/non-responders to paclitaxel for ERN HER2P (left) and ERN HER2N (right) breast cancer treatment were obtained using iPANDA. Up and downregulated pathways in responders group compared with non-responders group are shown in red and blue, respectively. The saturation of the colour denotes to corresponding AUC value. The same signalling pathways are found to be markers of responders/non-responders to paclitaxel treatment for four (ERN HER2P) or five (ERN HER2N) independent data sets obtained from GEO. Nineteen and eight pathway markers for ERN HER2P and ERN HER2N breast cancer, respectively, demonstrate AUC values higher than 0.7 for all data sets examined.
Figure 4Common marker pathway (CMP) index for responders/non-responders to paclitaxel treatment of ERN HER2P and ERN HER2N breast cancer types.
The index is calculated for four independent data sets obtained from GEO (GSE20194, GSE20271, GSE32646 and GSE50948) for each cancer type. Index demonstrates the robustness of the pathway marker lists between data sets. Independent application of the gene modules and topological coefficients did not improve the robustness of the algorithm for estimation of pathway activation; however, combined application resulted in significant improvement.
Training and validation data sets used in paclitaxel neoadjuvant therapy sensitivity prediction experiment in breast cancer patients.
| ERN HER2P | GSE20194, GSE20271 | 38 | GSE32636, GSE50948 | 57 |
| ERN HER2N | GSE20194, GSE20271 | 108 | GSE32646, GSE41998, GSE50948 | 115 |
| All cancer types | GSE20194, GSE20271 | 285 | GSE22513, GSE32646, GSE41998, GSE50948 | 299 |
Figure 5Performance of random forest models for paclitaxel sensitivity prediction in breast cancer patients.
Models were built in respect to three distinct end points: sensitivity of ERN HER2P cancer type, ERN HER2N type and all breast cancer types mixed. The models were trained using iPANDA, SPIA, GSEA, ssGSEA, PLAGE and DART pathway activation (enrichment) scores and gene-level data including: gene expression values for all genes (logGE), fold changes of tumour samples relative to the mean of paired normal samples for all genes (logFC), gene expression for differential genes only (genes are considered differentially expressed if t-test P value was <0.05 between tumour and normal samples, logDGE) and fold changes for differential genes only (logDFC). MCC, specificity and sensitivity performance metrics are shown for each model.