Literature DB >> 27587658

Complementary feature selection from alternative splicing events and gene expression for phenotype prediction.

Charles J Labuzzetta1, Margaret L Antonio2, Patricia M Watson3, Robert C Wilson3, Lauren A Laboissonniere4, Jeffrey M Trimarchi4, Baris Genc5, P Hande Ozdinler5, Dennis K Watson3, Paul E Anderson6.   

Abstract

MOTIVATION: A central task of bioinformatics is to develop sensitive and specific means of providing medical prognoses from biomarker patterns. Common methods to predict phenotypes in RNA-Seq datasets utilize machine learning algorithms trained via gene expression. Isoforms, however, generated from alternative splicing, may provide a novel and complementary set of transcripts for phenotype prediction. In contrast to gene expression, the number of isoforms increases significantly due to numerous alternative splicing patterns, resulting in a prioritization problem for many machine learning algorithms. This study identifies the empirically optimal methods of transcript quantification, feature engineering and filtering steps using phenotype prediction accuracy as a metric. At the same time, the complementary nature of gene and isoform data is analyzed and the feasibility of identifying isoforms as biomarker candidates is examined.
RESULTS: Isoform features are complementary to gene features, providing non-redundant information and enhanced predictive power when prioritized and filtered. A univariate filtering algorithm, which selects up to the N highest ranking features for phenotype prediction is described and evaluated in this study. An empirical comparison of pipelines for isoform quantification is reported by performing cross-validation prediction tests with datasets from human non-small cell lung cancer (NSCLC) patients, human patients with chronic obstructive pulmonary disease (COPD) and amyotrophic lateral sclerosis (ALS) transgenic mice, each including samples of diseased and non-diseased phenotypes.
AVAILABILITY AND IMPLEMENTATION: https://github.com/clabuzze/Phenotype-Prediction-Pipeline.git CONTACT: clabuzze@iastate.edu, antoniom@bc.edu, watsondk@musc.edu, andersonpe2@cofc.edu.
© The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

Entities:  

Mesh:

Year:  2016        PMID: 27587658      PMCID: PMC6276944          DOI: 10.1093/bioinformatics/btw430

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

Comprehensive analysis of high-throughput sequencing data remains a challenging task due to the inherent complexities of genetic transcript analysis from next-generation sequencing data (Kanitz ). An especially difficult aspect is the accurate estimation of gene and isoform transcript expression in RNA-Seq data (Liu ). Several methods have been developed which claim to approach the problem with an empirically superior algorithm; however, an objective analysis using non-simulated data is often difficult (Leng ; Trapnell ; Wang and Cairns, 2014). In RNA-Seq, ‘reads’ represent sequenced transcript fragments. The total number of reads aligning to each transcript is quantified as counts. It is difficult to verify isoform expression because correlation with phenotypes is rarely annotated in genomic databases and comprehensive validation using PCR is unrealistic. An optimal pipeline for isoform expression quantification is necessary in order to apply the full potential of high-throughput sequencing data to biomedical analysis (Kanitz ). However, distinct algorithm design and the loss of biological validation make analyses of isoform expression algorithms difficult (Leng ; Trapnell ; Wang and Cairns, 2014). Processing samples using RNA-Seq technology captures the expression of genes and isoforms, generated by alternative splicing (Li and Dewey, 2011; Trapnell ; Wang and Cairns, 2014). Previous methods of comparing transcript quantification and differential expression techniques have relied on the analysis of false-discovery rates and commonly identified alternative splicing patterns (Liu ). An objective method to compare algorithms that quantify transcript expression is to measure the ability of machine learning techniques to predict the phenotype of biological samples processed with each algorithm (Anderson ). Prediction accuracy can be used as a metric to determine which tool most reliably quantifies expression. Two of the most utilized transcript assembly and differential expression pipelines include Tophat/Cufflinks and RSEM/EBSeq (Leng ; Trapnell ). These approaches utilize varying statistical methods, each claiming to optimally address the challenge. SeqGSEA is another differential expression and alternative splicing platform that has yielded competitive results with respect to the two well-established pipelines (Wang and Cairns, 2014). The state-of-the-art nature of these tools provides a convincing argument to focus a comprehensive analysis on these three methods. These tools quantify expression for a massive number of transcripts. Feature selection and filtering can reduce the massive number of gene and isoform features to a subset that efficiently represents the original data. Machine learning algorithms such as Sparse Partial Least Squares (Chun and Keleş, 2010), Elastic Net (Zou and Hastie, 2005) and Random Forest (Liaw and Wiener, 2002) may be overwhelmed by noisy datasets due to over-fitting. Therefore, a method to rank these features and select those which best represent the quantitative distance between phenotypes may increase prediction accuracy. This article describes an empirically optimal method for phenotype prediction, revealing the critical nature of isoforms as features and recommending RSEM as a transcript quantification tool. Our most valuable predictors (MVP) filtering algorithm, increases prediction accuracy and suggests that filtering may allow researchers to focus validation on a relatively small number of transcripts. Our analysis shows that isoforms are complementary to genes, providing non-redundant information and enhanced predictive power. An analysis of phenotype prediction utilizing both gene and isoform transcripts requires the investigation of distinct transcript quantification methods, a novel investigation of the feature engineering of count-based isoforms into fractional-based isoforms, the MVP univariate filtering method and the evaluation of multiple machine learning algorithms. Comparisons between these pipelines are reported in this paper and recommendations for the optimal pipeline are provided along with a R-based implementation of the pipeline and MVP filtering method.

2 Methods

Three datasets of varying size were used to compare the pipelines generated for this analysis and to empirically develop our MVP filtering method. Included were datasets from human non-small cell lung cancer (NSCLC) patients, human patients with chronic obstructive pulmonary disease (COPD) and samples of pure corticospinal motor neuron populations isolated from amyotrophic lateral sclerosis (ALS) transgenic mice, each including samples of diseased and non-diseased phenotypes. Feature engineering from count-based isoform expression to fractional-based isoform expression is mathematically defined, and we discuss the utility of filtering/feature selection, including the MVP filtering method. We describe the datasets in detail and outline the alignment and transcript quantification processes using the Tuxedo Pipeline, RSEM/EBSeq and SeqGSEA. Finally, the selected machine learning algorithms and differential expression are reviewed.

2.1 MVP filtering algorithm

The MVP method filters gene or isoform expression tables with two phenotypes and sample replicates to prioritize features for phenotype prediction. Before calculating distribution distance, MVP drops features with null or zero expression in any sample. Each feature in the cross-validation test set is guaranteed, therefore, to have non-zero expression, which improves estimates of the sample mean and variance. Filtering the entire dataset for non-zero values does not bias the phenotype distributions as it is permutation invariant. The MVP method only ranks features that have a P-value < α determined by a t-test between phenotypes of the training set samples. For each significant feature, normal distributions P1 and P2 are modeled for the phenotypes with corresponding means μ1, μ2 and variances . The quantity r ranks features by distance between distributions such that increasing separation between means and decreasing total variance increases r score: The quantity r orders features similar to P-values generated by the t-test. However, the r quantity is designed to rank features in order of value as predictors by quantifying the distance between phenotype distributions, which may result in a better selection of features compared to P-value alone. MVP Algorithm: Input dataset with phenotypes P1 and P2 Retain only features where all samples have non-zero expression Retain only features where all samples have non-null expression Perform a t-test on all remaining features between P1 and P2 training set samples For all features with P-value < α: calculate quantity r using estimates of μ1, μ2 and from the training set phenotype distributions Select the features with one of the N highest r values as predictors, where N is the number of desired features.

2.2 Feature engineering

In addition to the massive number of isoforms, the robustness of isoform data can be increased by engineering count-based isoform expression to fractional-based isoform expression. Gene expression G, count-based isoform expression C and fractional-based isoform expression F are defined as follows: let Cij be the total read count for isoform where Ij is the set of isoforms of gene , and J is the set of all genes. Read count of gene j is (1). Fractional-based expression of each isoform i of gene j is therefore (2). Fractional-based isoform expression provides a normalization of isoform expression proportional to the corresponding gene expression. If the expression of all isoforms remained proportional in relation to the gene expression, fractional-based expression can retain the proportionality even in the case of extreme read counts in a sample. This may reduce the impact of samples with outlying read coverage which can impede the accurate estimation of phenotype distributions. Gene data may not be engineered into fractional data and therefore fractional-based isoform features may also be complementary to gene features.

2.3 Datasets

2.3.1 NSCLC

Non-small cell lung cancer RNA samples were taken from 21 patients with clinical outcomes determined by the American College of Surgery Oncology Group (Anderson ). Ten of these patients were diagnosed as disease free and 11 were diagnosed with relapse within 3 years of initial surgical resection. A total of 100–200 ng of total RNA was used to prepare libraries using the Illumina protocol for the TruSeq RNA Sample Prep Kit. These RNA-Seq libraries were paired-end sequenced on a HiScanSQ with 2 × 100 cycles and three samples per lane. The quality and adapter content of the paired-end sequences was measured with FASTQC (Patel and Jain, 2012). Trimmomatic 0.33 (Bolger ) removed the detected adapter content derived from the TruSeq2 Burnett Adapter Sequences while also trimming the ends of the sequences using the following settings: ILLUMINACLIP:TruSeq2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:40.

2.3.2 COPD

A 189 sample RNA-Seq COPD dataset of 98 COPD patients and 91 patients with normal lung tissue was discovered in the NCBI GEODatasets Database using the search terms: ‘expression profiling by high throughput sequencing’ [DataSet Type]), 20:1000 [n samples], ‘lung’ (Kim ). Ten replicates of each phenotype were randomly selected and the records were obtained using the SRAToolkit v2.5.2 to create a dataset similar in size to the NSCLC dataset (Leionen et al., 2011). This study focuses on the analysis of small/medium size datasets in terms of replicates, even though the COPD dataset offers the opportunity to increase the number of replicates used. These samples had previously been processed as.bam files aligned to the hg19 human genome (UCSC) using Tophat v2.0.0 and as paired end.fastq files for transcriptomic alignment using RSEM v1.2.25.

2.3.3 ALS

UCHL1-eGFP mice were generated to visualize and purify corticospinal motor neurons (CSMN) from the motor cortex, and CSMN identity of eGFP+ neurons was previously confirmed (Yasvoina ). hSOD1G93A-UeGFP mice were generated by crossbreeding UCHL1-eGFP with hSOD1G93A mice at Northwestern University. Both healthy (n = 4) and diseased (n = 4) CSMN were isolated from motor cortex upon cortical dissociation and FACS-mediated purification approaches at postnatal day 90, using previously established protocols (Ozdinler and Macklis, 2006). The generated mRNA was converted to a cDNA library using reverse transcription. The samples were sequenced at Iowa State University on an Illumina HiSeq 2500 after cDNA library-prep using Nextera’s DNA Sample Preparation Kit. All eight samples were paired-end sequenced in one lane. The quality and adapter content of the paired-end sequences was measured with FASTQC (Patel and Jain, 2012). BBMap v35.85 was used to remove contaminated sequences (Bushnell, 2015). Trimmomatic 0.33 (Bolger ) removed the detected adapter content and the sequences were trimmed using the following settings: ILLUMINACLIP:nextera.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:40.

2.4 Alignment and transcript quantification

The Tuxedo v2.2.1 pipeline begins with Tophat which runs Bowtie2 to align trimmed sequences to a reference genome (Trapnell ). The human bowtie indices were created from the hg19 human genome (UCSC) and the mouse bowtie indices were created from the mm10 mouse genome (UCSC). Tophat v2.0.0 was run with the default settings. CuffLinks assembled the transcripts using the corresponding genome GTF file as a reference. CuffMerge condenses each sample’s transcripts into a set which can be compared across all samples and the resulting GTF file was used as a reference for the CuffQuant and CuffDiff steps (Trapnell ). RSEM v1.2.25 ran Bowtie2 to align reads to the transcriptome which was constructed from the human hg19 and mouse mm10 reference genomes and the corresponding GTF file which annotates gene and isoform transcripts. RSEM was run with the default settings, but Bowtie2 was selected for alignment rather than Bowtie. RSEM assembles the transcripts and calculates their abundance using rsem-calculate-expression (Li and Dewey, 2011). The data tables created for each dataset containing the expected counts of genes or isoforms were each piped into EBSeq and normalized (Leng ). SeqGSEA requires.bam files synthesized from genomic alignment (Wang and Cairns, 2014). The files created by Tophat were used to provide input for SeqGSEA. SeqGSEA calculates expression levels on exon counts using a supplemental Python script provided with the SeqGSEA R package. The exon counts were calculated from the Tophat.bam files and the human hg19 and mouse mm10 GTF files. SeqGSEA detects gene expression by totaling the expression of all exons in each gene.

2.5 Differential expression

Due to the variation in differential expression algorithms by the Tuxedo Pipeline, EBSeq and SeqGSEA, a t-test was used to identify differentially expressed transcripts quantified and normalized by each tool. SeqGSEA was particularly problematic because it does not provide P-values for its transcripts, rather ranks the transcripts in order of predicted biological relevance. Therefore, a t-test was used to identify differentially expressed features for input to the machine learning algorithms. The t-test provided consistent differential expression compared to using each tool’s individual differential expression algorithm, which would have otherwise confounded the analysis of each transcript quantification method.

2.6 Machine learning algorithms

The following machine learning algorithms were chosen to perform the phenotype predictions. Random Forest was implemented from the ‘randomForest’ R package (Liaw and Wiener, 2002). The Elastic Net was run using the ‘glmnet’ R package (Friedman ; Zou and Hastie, 2005). Each pair of predictions was based on a cross-validated Elastic Net fit which automatically selected the optimal lambda level and feature number. The SPLS algorithm implementation came from the ‘mixOmics’ R package (Chun and Keleş, 2010; Dejean ). Each pair of predictions was based on a cross-validation test to select the optimal eta and kappa parameters.

2.7 Evaluation

In order to empirically compare multiple pipelines using various methods of transcript quantification, feature engineering, filtering and machine learning algorithms, the prediction accuracy of each pipeline may be compared via receiver operating curve (ROC) analysis. The ROC can be generated from the sensitivity and specificity of phenotype prediction (Robin ). For each pipeline comparison, cross-validation is used to create ROC curves and measure AUC. Below is an explanation of the leave-two-out cross-validation setup and summary of the pipeline comparison tests performed.

2.7.1 Cross-validation

To measure the predictive accuracy of each phenotype prediction pipeline, it is important to perform many predictions using the selected machine learning algorithms to quantify the sensitivity and specificity of the predictions. Using the pipelines for each dataset provides further information and comparison for each method. To perform leave-two-out cross-validation tests on each dataset, one sample of each phenotype is dropped iteratively while the remaining samples form the training set. Leave-two-out tests provide a robust estimation of the accuracy of each pipeline for phenotype prediction. The differential expression, filtering and machine learning steps are performed on the training set to select genes or isoforms that best represent the quantitative difference between the phenotypes. Then, the machine learning algorithm predicts the phenotypes of the dropped samples. By iterating through all possible training sets for a dataset, a robust analysis of the predictive accuracy of each pipeline is recorded in a ROC curve. This is generated from the sensitivity and specificity of the leave-two-out cross-validation test predictions. The AUC is a value between 0 and 1 that represents the accuracy of the predictions and is used in this study to compare the effectiveness of phenotype prediction pipelines.

2.7.2 Pipeline comparisons

Possible pipelines were permuted by selecting one option from each of the following steps. For each dataset, the 54 resulting pipelines were compared via leave-two-out cross-validation prediction tests. Transcript Quantification Method: Cufflinks, RSEM, SeqGSEA Feature Format: Genes, Count-based Isoforms, Fractional-based Isoforms Filtering: None, MVP Machine Learning Method: Random Forest, Elastic Net, SPLS

3 Results and discussion

To incrementally determine the optimal pipeline through analysis of the empirical data, the following questions must be answered: Which method reliably quantifies transcripts by producing consistently high AUC scores? Does the MVP filtering method enhance or decrease prediction accuracy? Which feature produces the most accurate predictions? Which machine learning algorithm most consistently performs better or equal to the others? Are isoform features redundant or complementary to gene features? The answers to these questions will be investigated in the following sections by analyzing the AUC scores generated via leave-two-out cross-validation prediction tests on each dataset.

3.1 Transcript quantification

Transcript expression quantification, the calculation of read abundance for both genes and isoforms, is a difficult task. The optimal statistical technique required for this task is still being explored (Leng ; Trapnell ; Wang and Cairns, 2014). One goal of this study is to determine whether Cufflinks, RSEM or SeqGSEA consistently produces the highest phenotype prediction accuracy as a result of a superior transcript quantification algorithm. Our study compares the AUC scores generated by pipelines including each transcript quantification tool with and without the MVP filter method (Fig. 1). Pipelines using RSEM for transcript quantification and the MVP filtering method for feature selection had significantly higher AUC scores than pipelines using the other transcript quantification methods.
Fig. 1.

Phenotype prediction by pipelines using variable transcript quantification tools and filtering. AUC values were generated by running each dataset (NSCLC, ALS, COPD) through 54 pipelines that varied in transcript quantification tool, feature type (gene, isoform count, isoform fraction), use of filtering and machine learning algorithm (Random Forest, Elastic Net, SPLS). Predictive results are shown grouped by dataset, transcript quantification tool and use of filtering

Phenotype prediction by pipelines using variable transcript quantification tools and filtering. AUC values were generated by running each dataset (NSCLC, ALS, COPD) through 54 pipelines that varied in transcript quantification tool, feature type (gene, isoform count, isoform fraction), use of filtering and machine learning algorithm (Random Forest, Elastic Net, SPLS). Predictive results are shown grouped by dataset, transcript quantification tool and use of filtering We have compiled detailed descriptions of pipelines and corresponding AUC scores (Tables 1–3). For many of the pipelines processed using Cufflinks for transcript quantification, the AUC scores were comparable to those using RSEM when analyzing the ALS and COPD datasets (Tables 1 and 2). RSEM, however, performed better on the NSCLC dataset compared to both Cufflinks and SeqGSEA, especially when predictions were generated from fractional-isoform based data (Table 3). It is likely that AUC scores had large variability when using SeqGSEA due to the large number of exons in any dataset making even more difficult the challenge of selecting reliable features (Fig. 1).
Table 1.

ALS dataset analysis

Transcript quantificationFeature typeFilterMachine learning Alg.AUCConfidence Int.Sens.Spec.Mean features*
CufflinksGenesNoneElastic Net0.4960.283–0.7090.6250.625516
CufflinksGenesNoneRandom Forest0.7660.558–0.9731.0000.750516
CufflinksGenesNoneSPLS0.6060.391–0.8200.6880.688516
CufflinksGenesMVPElastic Net0.9180.809–1.0001.0000.81338
CufflinksGenesMVPRandom Forest0.8830.756–1.0001.0000.75038
CufflinksGenesMVPSPLS0.8630.719–1.0000.7501.00038
CufflinksIsoform CountNoneElastic Net0.7810.601–0.9620.8750.8131146
CufflinksIsoform CountNoneRandom Forest0.8150.643–0.9861.0000.7501146
CufflinksIsoform CountNoneSPLS0.6560.449–0.8630.6980.8131146
CufflinksIsoform CountMVPElastic Net0.7930.607–0.9790.8130.81350
CufflinksIsoform CountMVPRandom Forest0.8790.752–1.0001.0000.75050
CufflinksIsoform CountMVPSPLS0.7150.516–0.9130.7500.81350
CufflinksIsoform FractionNoneElastic Net1.0001.000–1.0001.0001.00093
CufflinksIsoform FractionNoneRandom Forest0.8400.687–0.9931.0000.75093
CufflinksIsoform FractionNoneSPLS0.9380.815–1.0000.9381.00093
CufflinksIsoform FractionMVPElastic Net0.8910.776–1.0001.0000.75047
CufflinksIsoform FractionMVPRandom Forest0.8710.739–1.0001.0000.75047
CufflinksIsoform FractionMVPSPLS0.9380.815–1.0000.9381.00047
RSEMGenesNoneElastic Net0.6990.489–0.9090.7500.875631
RSEMGenesNoneRandom Forest0.7580.545–0.9711.0000.750631
RSEMGenesNoneSPLS0.7460.561–0.9320.7500.750631
RSEMGenesMVPElastic Net0.8240.653–0.9950.9380.75050
RSEMGenesMVPRandom Forest0.9470.880–1.0001.0000.75050
RSEMGenesMVPSPLS0.5000.265–0.7350.4380.93850
RSEMIsoform CountNoneElastic Net0.6990.501–0.8980.8750.688740
RSEMIsoform CountNoneRandom Forest0.7730.572–0.9751.0000.750740
RSEMIsoform CountNoneSPLS0.7460.566–0.9260.8130.750740
RSEMIsoform CountMVPElastic Net0.9100.811–1.0001.0000.75050
RSEMIsoform CountMVPRandom Forest0.9630.904–1.0001.0000.87550
RSEMIsoform CountMVPSPLS0.6720.451–0.8920.6250.93850
RSEMIsoform FractionNoneElastic Net0.6640.463–0.8660.8750.563180
RSEMIsoform FractionNoneRandom Forest0.7790.582–0.9771.0000.750180
RSEMIsoform FractionNoneSPLS0.8160.643–0.9900.8750.813180
RSEMIsoform FractionMVPElastic Net0.7810.585–0.9781.0000.75041
RSEMIsoform FractionMVPRandom Forest0.9020.787–1.0001.0000.75041
RSEMIsoform FractionMVPSPLS0.9490.871–1.0001.0000.87541
SeqGSEAGenesNoneElastic Net0.6370.435–0.8390.8130.563331
SeqGSEAGenesNoneRandom Forest0.7700.565–0.9741.0000.75331
SeqGSEAGenesNoneSPLS0.8200.666–0.9750.8130.750331
SeqGSEAGenesMVPElastic Net0.7850.608–0.9630.8130.75047
SeqGSEAGenesMVPRandom Forest0.9810.946–1.0000.9380.93847
SeqGSEAGenesMVPSPLS0.7500.553–0.9470.8130.75047
SeqGSEAExon CountNoneElastic Net0.4730.259–0.6870.4380.7501181
SeqGSEAExon CountNoneRandom Forest0.7700.565–0.9741.0000.7501181
SeqGSEAExon CountNoneSPLS0.8200.647–0.9940.6881.0001181
SeqGSEAExon CountMVPElastic Net0.9060.781–1.0000.9380.87550
SeqGSEAExon CountMVPRandom Forest0.9960.985–1.0001.0000.93850
SeqGSEAExon CountMVPSPLS0.9340.818–1.0000.8751.00050
SeqGSEAExon FractionNoneElastic Net0.8440.701–0.9860.8130.8132669
SeqGSEAExon FractionNoneRandom Forest0.7580.545–0.9711.0000.7502669
SeqGSEAExon FractionNoneSPLS0.7660.558–0.9731.0000.7502669
SeqGSEAExon FractionMVPElastic Net0.6720.466–0.8780.7500.75050
SeqGSEAExon FractionMVPRandom Forest0.7970.612–0.9821.0000.75050
SeqGSEAExon FractionMVPSPLS0.7850.590–0.9801.0000.75050

*Mean number of transcripts selected as features per leave-two-out cross-validation test.

Table 2.

COPD dataset analysis

Transcript quantificationFeature typeFilterMachine learning Alg.AUCConfidence Int.Sens.Spec.Mean features*
CufflinksGenesNoneElastic Net0.9960.990–1.0000.9800.9901297
CufflinksGenesNoneRandom Forest0.9100.867–0.9530.9300.8101297
CufflinksGenesNoneSPLS0.8170.757–0.8760.6800.8401297
CufflinksGenesMVPElastic Net0.9530.929–0.9780.8600.89050
CufflinksGenesMVPRandom Forest0.9350.904–0.9670.8900.86050
CufflinksGenesMVPSPLS0.8910.846–0.9350.7200.95050
CufflinksIsoform CountNoneElastic Net0.8750.826–0.9240.8200.8202959
CufflinksIsoform CountNoneRandom Forest0.9290.895–0.9630.9100.8102959
CufflinksIsoform CountNoneSPLS0.5640.484–0.6440.2500.9102958
CufflinksIsoform CountMVPElastic Net0.6670.590–0.7440.6000.78050
CufflinksIsoform CountMVPRandom Forest0.9230.889–0.9570.8800.80050
CufflinksIsoform CountMVPSPLS0.5320.451–0.6130.8200.33050
CufflinksIsoform FractionNoneElastic Net0.7660.702–0.8300.5200.8802283
CufflinksIsoform FractionNoneRandom Forest0.9410.912–0.9710.8900.8402283
CufflinksIsoform FractionNoneSPLS0.4480.366–0.5300.5000.6402283
CufflinksIsoform FractionMVPElastic Net0.7280.648–0.8070.7400.85050
CufflinksIsoform FractionMVPRandom Forest0.8920.845–0.9390.7700.94050
CufflinksIsoform FractionMVPSPLS0.5230.445–0.6110.5600.69050
RSEMGenesNoneElastic Net0.8600.808–0.9120.9000.7001982
RSEMGenesNoneRandom Forest0.8440.791–0.8970.7000.8901982
RSEMGenesNoneSPLS0.8700.820–0.9210.8100.8001982
RSEMGenesMVPElastic Net0.9600.934–0.9860.9000.94050
RSEMGenesMVPRandom Forest0.9150.878–0.9520.8700.84050
RSEMGenesMVPSPLS0.8740.828–0.9200.7500.84050
RSEMIsoform CountNoneElastic Net0.7880.724–0.8520.8700.6603435
RSEMIsoform CountNoneRandom Forest0.8990.855–0.9420.8200.8703435
RSEMIsoform CountNoneSPLS0.9410.912–0.9700.8200.9103435
RSEMIsoform CountMVPElastic Net0.9720.952–0.9910.9000.94050
RSEMIsoform CountMVPRandom Forest0.9970.993–1.0001.0000.96050
RSEMIsoform CountMVPSPLS0.9370.906–0.9670.8900.82050
RSEMIsoform FractionNoneElastic Net0.9770.961–0.9930.8900.970321
RSEMIsoform FractionNoneRandom Forest0.9540.926–0.9820.8600.970321
RSEMIsoform FractionNoneSPLS0.9620.938–0.9860.8900.980321
RSEMIsoform FractionMVPElastic Net0.9650.941–0.9890.9000.97050
RSEMIsoform FractionMVPRandom Forest0.9610.937–0.9850.8700.99050
RSEMIsoform FractionMVPSPLS0.9500.923–0.9760.8700.90050
SeqGSEAGenesNoneElastic Net0.9080.870–0.9470.7300.920672
SeqGSEAGenesNoneRandom Forest0.8510.797–0.9050.8600.790672
SeqGSEAGenesNoneSPLSNANA–NANANANA
SeqGSEAGenesMVPElastic Net0.7340.663–0.8050.7100.71050
SeqGSEAGenesMVPRandom Forest0.7390.671–0.8070.6000.81050
SeqGSEAGenesMVPSPLS0.6800.604–0.7550.6400.73050
SeqGSEAExon CountNoneElastic Net0.6530.578–0.7280.3600.87021862
SeqGSEAExon CountNoneRandom Forest0.8580.800–0.9150.9000.85021862
SeqGSEAExon CountNoneSPLS0.9080.860–0.9550.8300.95021862
SeqGSEAExon CountMVPElastic Net0.7370.668–0.8070.7700.66050
SeqGSEAExon CountMVPRandom Forest0.7060.633–0.7790.8400.58050
SeqGSEAExon CountMVPSPLS0.8010.738–0.8640.7900.79050
SeqGSEAExon FractionNoneElastic Net0.8310.777–0.8850.8000.81041563
SeqGSEAExon FractionNoneRandom Forest0.9320.896–0.9690.9000.89041563
SeqGSEAExon FractionNoneSPLS0.9430.908–0.9780.9900.89041563
SeqGSEAExon FractionMVPElastic Net0.9160.875–0.9570.8800.89050
SeqGSEAExon FractionMVPRandom Forest0.9480.920–0.9770.8201.00050
SeqGSEAExon FractionMVPSPLS0.9710.951–0.9920.9000.96050

*Mean number of transcripts selected as features per leave-two-out cross-validation test.

Table 3.

NSCLC dataset analysis

Transcript quantificationFeature typeFilterMachine learning Alg.AUCConfidence Int.Sens.Spec.Mean features*
CufflinksGenesNoneElastic Net0.4160.348–0.4831.0000.0001682
CufflinksGenesNoneRandom Forest0.6350.555–0.7150.8090.6181682
CufflinksGenesNoneSPLS0.6130.534–0.6930.9000.5451682
CufflinksGenesMVPElastic Net0.5430.465–0.6220.8910.34550
CufflinksGenesMVPRandom Forest0.6030.526–0.6030.8090.45550
CufflinksGenesMVPSPLS0.6640.587–0.7410.8180.62750
CufflinksIsoform CountNoneElastic Net0.5060.436–0.5770.2640.8363554
CufflinksIsoform CountNoneRandom Forest0.6710.593–0.7480.7910.6363554
CufflinksIsoform CountNoneSPLSNANA–NANANANA
CufflinksIsoform CountMVPElastic Net0.5410.465–0.6180.6230.44550
CufflinksIsoform CountMVPRandom Forest0.5810.505–0.6560.6360.53650
CufflinksIsoform CountMVPSPLS0.4470.370–0.5240.5550.50050
CufflinksIsoform FractionNoneElastic Net0.4110.358–0.4641.0000.0002623
CufflinksIsoform FractionNoneRandom Forest0.4970.418–0.5750.6450.4732623
CufflinksIsoform FractionNoneSPLSNANA–NANANANA
CufflinksIsoform FractionMVPElastic Net0.6320.558–0.7070.7180.70950
CufflinksIsoform FractionMVPRandom Forest0.6360.562–0.7100.7640.51850
CufflinksIsoform FractionMVPSPLS0.5900.515–0.6650.3550.81850
RSEMGenesNoneElastic Net0.4870.417–0.5580.8640.1821216
RSEMGenesNoneRandom Forest0.6290.550–0.7080.7640.6091216
RSEMGenesNoneSPLS0.5580.479–0.6380.8640.4091216
RSEMGenesMVPElastic Net0.5590.482–0.6360.9450.25450
RSEMGenesMVPRandom Forest0.5290.449–0.6100.8360.38250
RSEMGenesMVPSPLS0.6680.590–0.7450.8450.62750
RSEMIsoform CountNoneElastic Net0.4210.366–0.4761.0000.0001747
RSEMIsoform CountNoneRandom Forest0.6090.530–0.6880.8450.5001747
RSEMIsoform CountNoneSPLS0.5350.453–0.6160.8000.5091747
RSEMIsoform CountMVPElastic Net0.5190.441–0.5970.7910.36450
RSEMIsoform CountMVPRandom Forest0.5810.500–0.6610.6820.61850
RSEMIsoform CountMVPSPLS0.7380.671–0.8050.7820.69150
RSEMIsoform FractionNoneElastic Net0.3730.333–0.4141.0000.000880
RSEMIsoform FractionNoneRandom Forest0.5860.509–0.6640.8000.500880
RSEMIsoform FractionNoneSPLS0.6640.586–0.7430.8640.645880
RSEMIsoform FractionMVPElastic Net0.9230.886–0.9610.9550.81850
RSEMIsoform FractionMVPRandom Forest0.5890.513–0.6640.2730.90950
RSEMIsoform FractionMVPSPLS0.8170.762–0.8170.9090.63650
SeqGSEAGenesNoneElastic Net0.4150.355–0.4741.0000.009935
SeqGSEAGenesNoneRandom Forest0.6090.531–0.6860.7090.600935
SeqGSEAGenesNoneSPLS0.5510.470–0.6320.8820.427935
SeqGSEAGenesMVPElastic Net0.5320.455–0.6090.4730.69150
SeqGSEAGenesMVPRandom Forest0.5300.453–0.6070.2820.85550
SeqGSEAGenesMVPSPLS0.6370.562–0.7110.6910.60950
SeqGSEAExon CountNoneElastic Net0.3930.334–0.4520.0360.9648245
SeqGSEAExon CountNoneRandom Forest0.5820.506–0.6580.5550.6278245
SeqGSEAExon CountNoneSPLS0.5410.460–0.6210.8730.4188245
SeqGSEAExon CountMVPElastic Net0.5070.429–0.5850.6360.49150
SeqGSEAExon CountMVPRandom Forest0.6780.602–0.7430.6000.68250
SeqGSEAExon CountMVPSPLS0.5690.492–0.6470.6640.56450
SeqGSEAExon FractionNoneElastic Net0.3780.313–0.4430.9910.0367524
SeqGSEAExon FractionNoneRandom Forest0.6210.546–0.6970.6730.5917524
SeqGSEAExon FractionNoneSPLS0.5280.450–0.6060.7270.4647524
SeqGSEAExon FractionMVPElastic Net0.6360.563–0.5550.5630.69150
SeqGSEAExon FractionMVPRandom Forest0.5640.488–0.6400.5360.59150
SeqGSEAExon FractionMVPSPLS0.5070.430–0.5850.4640.65550

*Mean number of transcripts selected as features per leave-two-out cross-validation test.

ALS dataset analysis *Mean number of transcripts selected as features per leave-two-out cross-validation test. COPD dataset analysis *Mean number of transcripts selected as features per leave-two-out cross-validation test. NSCLC dataset analysis *Mean number of transcripts selected as features per leave-two-out cross-validation test. The significant increase in prediction accuracy when using RSEM for transcript quantification and the MVP filter for feature selection suggests that an optimal pipeline for phenotype prediction should include these options.

3.2 MVP filtering

This analysis confidently shows MVP filtering consistently enhances the prediction accuracy of each feature type across all datasets when using RSEM for transcript quantification. The mean AUC score for phenotype predictions in each dataset increased when using the MVP filter (Fig. 2). Reducing the feature size per iteration from greater than 1000 on average to 50, with a general increase in accuracy, is more efficient and a significant improvement in feature selection. This is further evidence for the inclusion of a filtering method such as MVP in the development of an optimal phenotype prediction pipeline.
Fig. 2.

Predictive power of pipelines that use RSEM with varying input datasets and filtering. AUC values were generated by running all input datasets through nine pipelines that all performed transcript quantification with RSEM, but varied in feature type (gene, isoform count, isoform fraction), use of filtering and machine learning algorithm (Random Forest, Elastic Net, SPLS). Predictive results for these pipelines are shown grouped by dataset and use of filtering

Predictive power of pipelines that use RSEM with varying input datasets and filtering. AUC values were generated by running all input datasets through nine pipelines that all performed transcript quantification with RSEM, but varied in feature type (gene, isoform count, isoform fraction), use of filtering and machine learning algorithm (Random Forest, Elastic Net, SPLS). Predictive results for these pipelines are shown grouped by dataset and use of filtering The selection of the top 50 features ranked by the MVP filtering method may provide important biomarker candidates to biomedical researchers. Following the identification of consistently selected features in cross-validation tests, these features can be tested using corrected t-tests and other differential expression methods to identify viable biomarker candidates. It may be reasonable to correlate the list of MVP features with features identified as differentially expressed by EBSeq or CuffDiff. The promising genes and isoforms may be further processed using qPCR to validate biological importance.

3.3 Features and feature engineering

Isoform data has not traditionally been included in phenotype prediction. This analysis shows that both count-based isoform data and fractional-based isoform data are competitive in regard to gene expression data and even have enhanced prediction accuracy. The AUC scores of pipelines based on each feature when using RSEM for transcript quantification were compared (Fig. 3).
Fig. 3.

Predictive power of pipelines that use RSEM with varying feature types and filtering. AUC values were generated by running all input datasets (NSCLC, ALS, COPD) through nine pipelines that all performed transcript quantification with RSEM, but varied in feature type (gene, isoform count, isoform fraction), use of filtering and machine learning algorithm (Random Forest, Elastic Net, SPLS). Predictive values are shown grouped by feature type and whether filtering was applied

Predictive power of pipelines that use RSEM with varying feature types and filtering. AUC values were generated by running all input datasets (NSCLC, ALS, COPD) through nine pipelines that all performed transcript quantification with RSEM, but varied in feature type (gene, isoform count, isoform fraction), use of filtering and machine learning algorithm (Random Forest, Elastic Net, SPLS). Predictive values are shown grouped by feature type and whether filtering was applied Count-based isoform expression data using the MVP filtering method generated many of the highest AUC’s in this analysis (Tables 1–3). The enhancement compared to gene expression data may result from the increased number of features that exist compared to genes alone. This result may also implicate the largely overlooked active isoforms that are involved in phenotype expression. Fractional-based isoform expression produced AUC scores comparable to those of count-based isoform data (Fig. 3). Fractional-based AUC’s seem to be superior when analyzing datasets with features that have outlier expression counts, such as the NSCLC dataset where many outliers were detected using iLOO (George ) (Table 3). Fractional-based isoform data quantified by RSEM with MVP filtering produced the only viable AUC scores in the NSCLC dataset (Table 3). The proportional normalization of each isoform in regard to corresponding gene expression may reduce the impact of extreme read counts and outliers. This may make differential expression tests more reliable in such cases and supports the case for performing phenotype prediction from fractional-based isoform data.

3.4 Machine learning algorithms

Both the Random Forest and Elastic Net machine learning algorithms produced promising results across the datasets, especially after MVP filtering (Fig. 4). SPLS was more variable than both Random Forest and Elastic Net, and produced several NA results when SPLS failed due to low variance features. Random Forest generated the highest observed AUC scores after MVP filtering. The Elastic Net consistently produced results within range or superior to Random Forest, and produced much greater scores in several datasets where Random Forest and SPLS did not generate accurate predictions (Fig. 4).
Fig. 4.

Predictive power of pipelines that use RSEM with varying machine learning algorithms and filtering. AUC values were generated by running all input datasets (NSCLC, ALS, COPD) through nine pipelines that all performed transcript quantification with RSEM, but varied in feature type (gene, isoform count, isoform fraction), use of filtering and machine learning algorithm (Random Forest, Elastic Net, SPLS). Predictive values are shown grouped by machine learning algorithm and whether filtering was applied

Predictive power of pipelines that use RSEM with varying machine learning algorithms and filtering. AUC values were generated by running all input datasets (NSCLC, ALS, COPD) through nine pipelines that all performed transcript quantification with RSEM, but varied in feature type (gene, isoform count, isoform fraction), use of filtering and machine learning algorithm (Random Forest, Elastic Net, SPLS). Predictive values are shown grouped by machine learning algorithm and whether filtering was applied Due to the fact that the Elastic Net performed more consistently on all datasets (Tables 1–3) and generated AUC scores comparable in accuracy to Random Forest (Fig. 4), it seems optimal to include the Elastic Net in the phenotype prediction pipeline.

3.5 Complementary features

Isoforms offer a complementary and non-redundant set of features for phenotype prediction. When selected fractional-based isoform features were converted to the respective gene name and compared to the list of gene features for the optimal RSEM-based pipeline, little overlap between the two lists of features was found. There was 4.49% overlap in the ALS dataset, 1.93% overlap in the COPD dataset and 0% overlap in the NSCLC dataset. This reinforces the importance of including isoform expression data in phenotype prediction analyses.

4 Conclusion

The results of this study support two major conclusions. First, we have identified an optimal pipeline for phenotype prediction by answering the previously discussed questions on the construction of such a tool. RSEM generally generated the highest isoform based AUC scores with MVP filtering compared to other transcript quantification tools. We have shown that feature selection via a filtering method, such as the MVP filtering algorithm, consistently increases AUC score. Overall, fractional-based isoform features can be analyzed using the Elastic Net to yield the most consistently accurate phenotype predictions. These methods have been built into an open source pipeline available via GitHub. Second, we have identified the complementary nature of isoform expression data. Isoform features provide non-redundant information and enhanced predictive power compared to gene features. Our paper addresses the necessity of including isoform expression data in phenotype prediction and biomedical data analysis.

4.1 Pipeline

The Phenotype Prediction Pipeline is implemented in R. Extensive documentation and the full source code are available at: https://github.com/clabuzze/Phenotype-Prediction-Pipeline.git
  16 in total

1.  EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments.

Authors:  Ning Leng; John A Dawson; James A Thomson; Victor Ruotti; Anna I Rissman; Bart M G Smits; Jill D Haag; Michael N Gould; Ron M Stewart; Christina Kendziorski
Journal:  Bioinformatics       Date:  2013-02-21       Impact factor: 6.937

2.  IGF-I specifically enhances axon outgrowth of corticospinal motor neurons.

Authors:  P Hande Ozdinler; Jeffrey D Macklis
Journal:  Nat Neurosci       Date:  2006-10-22       Impact factor: 24.884

3.  Regularization Paths for Generalized Linear Models via Coordinate Descent.

Authors:  Jerome Friedman; Trevor Hastie; Rob Tibshirani
Journal:  J Stat Softw       Date:  2010       Impact factor: 6.440

4.  eGFP expression under UCHL1 promoter genetically labels corticospinal motor neurons and a subpopulation of degeneration-resistant spinal motor neurons in an ALS mouse model.

Authors:  Marina V Yasvoina; Baris Genç; Javier H Jara; Patrick L Sheets; Katharina A Quinlan; Ana Milosevic; Gordon M G Shepherd; C J Heckman; P Hande Özdinler
Journal:  J Neurosci       Date:  2013-05-01       Impact factor: 6.167

5.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome.

Authors:  Bo Li; Colin N Dewey
Journal:  BMC Bioinformatics       Date:  2011-08-04       Impact factor: 3.307

6.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.

Authors:  Cole Trapnell; Brian A Williams; Geo Pertea; Ali Mortazavi; Gordon Kwan; Marijke J van Baren; Steven L Salzberg; Barbara J Wold; Lior Pachter
Journal:  Nat Biotechnol       Date:  2010-05-02       Impact factor: 54.908

7.  Comprehensive Analysis of Transcriptome Sequencing Data in the Lung Tissues of COPD Subjects.

Authors:  Woo Jin Kim; Jae Hyun Lim; Jae Seung Lee; Sang-Do Lee; Ju Han Kim; Yeon-Mok Oh
Journal:  Int J Genomics       Date:  2015-03-05       Impact factor: 2.326

8.  An Iterative Leave-One-Out Approach to Outlier Detection in RNA-Seq Data.

Authors:  Nysia I George; John F Bowyer; Nathaniel M Crabtree; Ching-Wei Chang
Journal:  PLoS One       Date:  2015-06-03       Impact factor: 3.240

9.  Comparisons of computational methods for differential alternative splicing detection using RNA-seq in plant systems.

Authors:  Ruolin Liu; Ann E Loraine; Julie A Dickerson
Journal:  BMC Bioinformatics       Date:  2014-12-16       Impact factor: 3.169

10.  Trimmomatic: a flexible trimmer for Illumina sequence data.

Authors:  Anthony M Bolger; Marc Lohse; Bjoern Usadel
Journal:  Bioinformatics       Date:  2014-04-01       Impact factor: 6.937

View more
  3 in total

1.  Biological classification with RNA-seq data: Can alternatively spliced transcript expression enhance machine learning classifiers?

Authors:  Nathan T Johnson; Andi Dhroso; Katelyn J Hughes; Dmitry Korkin
Journal:  RNA       Date:  2018-06-25       Impact factor: 4.942

2.  LogLoss-BERAF: An ensemble-based machine learning model for constructing highly accurate diagnostic sets of methylation sites accounting for heterogeneity in prostate cancer.

Authors:  K Babalyan; R Sultanov; E Generozov; E Sharova; E Kostryukova; A Larin; A Kanygina; V Govorun; G Arapidi
Journal:  PLoS One       Date:  2018-11-02       Impact factor: 3.240

3.  DeepPheno: Predicting single gene loss-of-function phenotypes using an ontology-aware hierarchical classifier.

Authors:  Maxat Kulmanov; Robert Hoehndorf
Journal:  PLoS Comput Biol       Date:  2020-11-18       Impact factor: 4.475

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.