| Literature DB >> 26109056 |
Wenqian Zhang1, Ying Yu2, Falk Hertwig3,4, Jean Thierry-Mieg5, Wenwei Zhang1, Danielle Thierry-Mieg5, Jian Wang6, Cesare Furlanello7, Viswanath Devanarayan8, Jie Cheng9, Youping Deng10, Barbara Hero3, Huixiao Hong11, Meiwen Jia2, Li Li12, Simon M Lin13, Yuri Nikolsky14, André Oberthuer3, Tao Qing2, Zhenqiang Su11, Ruth Volland3, Charles Wang15, May D Wang16, Junmei Ai10, Davide Albanese17, Shahab Asgharzadeh18, Smadar Avigad19, Wenjun Bao12, Marina Bessarabova14, Murray H Brilliant20, Benedikt Brors21, Marco Chierici7, Tzu-Ming Chu12, Jibin Zhang1, Richard G Grundy22, Min Max He13, Scott Hebbring20, Howard L Kaufman10, Samir Lababidi23, Lee J Lancashire14, Yan Li10, Xin X Lu24, Heng Luo11,25, Xiwen Ma26, Baitang Ning11, Rosa Noguera27, Martin Peifer4,28, John H Phan16, Frederik Roels3,4, Carolina Rosswog3, Susan Shao12, Jie Shen11, Jessica Theissen3, Gian Paolo Tonini29, Jo Vandesompele30, Po-Yen Wu31, Wenzhong Xiao32, Joshua Xu11, Weihong Xu33, Jiekun Xuan11, Yong Yang6, Zhan Ye13, Zirui Dong1, Ke K Zhang34, Ye Yin1, Chen Zhao2, Yuanting Zheng2, Russell D Wolfinger12, Tieliu Shi35, Linda H Malkas36, Frank Berthold3,4, Jun Wang1,37,38,39, Weida Tong11, Leming Shi40,41, Zhiyu Peng42,43, Matthias Fischer44,45.
Abstract
BACKGROUND: Gene expression profiling is being widely applied in cancer research to identify biomarkers for clinical endpoint prediction. Since RNA-seq provides a powerful tool for transcriptome-based applications beyond the limitations of microarrays, we sought to systematically evaluate the performance of RNA-seq-based and microarray-based classifiers in this MAQC-III/SEQC study for clinical endpoint prediction using neuroblastoma as a model.Entities:
Mesh:
Year: 2015 PMID: 26109056 PMCID: PMC4506430 DOI: 10.1186/s13059-015-0694-1
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Clinical characteristics of neuroblastoma patients
| Number | Percent of total | |
|---|---|---|
|
| ||
| Normal | 401 | 80.5 % |
| Amplified | 92 | 18.5 % |
| N.A. | 5 | 1.0 % |
| INSS stage | ||
| 1 | 121 | 24.3 % |
| 2 | 78 | 15.7 % |
| 3 | 63 | 12.7 % |
| 4 | 183 | 36.7 % |
| 4S | 53 | 10.6 % |
| Age at diagnosis | ||
| <18 months | 300 | 60.2 % |
| >18 months | 198 | 39.8 % |
| Sex | ||
| Male | 278 | 55.8 % |
| Female | 205 | 41.2 % |
| N.A. | 15 | 3.0 % |
| High-risk patients | 176 | 35.3 % |
NA, not available
Fig. 1Characteristics of the neuroblastoma transcriptome according to RNA-seq data using the Magic-AceView pipeline. a Percentage of reads mapped to distinct targets. b Number of genes, transcripts, and exon-junctions expressed in the entire neuroblastoma cohort according to their annotation by AceView. c Absolute numbers and overlap of differentially expressed genes (DEGs) identified by RNA-seq (red) and microarrays (blue) in four disease subgroups (see main text)
Definition of clinical endpoints analyzed in this study
| Cohort | Endpoint (bin 1/0) | Training set | Validation set | ||||
|---|---|---|---|---|---|---|---|
| # Samples | 1 | 0 | # Samples | 1 | 0 | ||
| All patients (498) | SEX | 249 | 103 | 146 | 249 | 108 | 141 |
| (Female/Male) | |||||||
| EFS ALL | 249 | 89 | 160 | 249 | 94 | 155 | |
| (Event yes/no) | |||||||
| OS ALL | 249 | 51 | 198 | 249 | 54 | 195 | |
| (Death yes/no) | |||||||
| Class labeled patients (272) | CLASS LABEL | 136 | 45 | 91 | 136 | 46 | 90 |
| (Unfavorable/Favorable) | |||||||
| High-risk patients (176) | EFS HR | 86 | 55 | 31 | 90 | 65 | 25 |
| (Event yes/no) | |||||||
| OS HR | 86 | 43 | 43 | 90 | 49 | 41 | |
| (Death yes/no) | |||||||
EFS, event-free survival; HR, high risk; OS, overall survival
Fig. 2Performances of RNA-seq- and microarray-based models to predict clinical endpoints in the validation cohorts. a Schematic overview of gene expression profiles generated by RNA-seq (n = 9 per sample) and microarray (n = 1 per sample). CL, Cufflinks; MAV, Magic-AceView; TAV, TopHat-AceView; TUC, TopHat-UCSC. b Distribution of MCC values of all models for each endpoint according to the technical platform (MA, microarray). Boxes indicate the 25 % and 75 % percentiles, and whiskers indicate the 5 % and 95 % percentiles; (*), P <0.05; two-sided T-test was performed for statistical testing. c, d Model performance of internal validation compared with external validation based on (c) microarray and (d) RNA-seq expression data in terms of MCC
Fig. 3Analysis of factors potentially affecting prediction performances of RNA-seq-based models. a Distribution of MCC values of all models for each endpoint according to RNA-seq data processing pipelines (MAV, Magic-AceView; TAV, TopHat-AceView; TUC, TopHat-UCSC). b Distribution of MCC values of all models for each endpoint according to feature levels, that is, gene, transcript (TS), and exon-junction (Jct) levels
Fig. 4a Contribution of different factors to the variability of prediction results as assessed by variance component analysis. (*), P <0.05; (**), P <0.01. The factors platform, RNA-seq pipeline, feature level, analysis team, classification method, and model size were analyzed both independently of the endpoint (white box), and taking a potential endpoint-dependence into account (gray box). b Best linear unbiased predictor (BLUP) estimates for the log10(model size) as the single factor contributing significantly to the prediction variability independent of the endpoint. Note that BLUPs are centered around zero and effectively average over all other effects. BLUPs for Log10(Model Size) indicate that models with 100 to 1,000 features perform better than those with fewer or more features
Fig. 5Correlation of prediction performances with the feature composition of prediction models. MCC values of MAV and TAV models were plotted against the fraction of RefSeq-annotated genes (a), the fraction of protein-coding genes (b), and the fraction of spliced genes (that is, genes or transcripts consisting of at least two exons; (c) in the model