| Literature DB >> 34634029 |
Zifeng Wang1, Aria Masoomi1, Zhonghui Xu2, Adel Boueiz2,3, Sool Lee2, Tingting Zhao1, Russell Bowler4, Michael Cho2,3, Edwin K Silverman2,3, Craig Hersh2,3, Jennifer Dy1, Peter J Castaldi2,5.
Abstract
Most predictive models based on gene expression data do not leverage information related to gene splicing, despite the fact that splicing is a fundamental feature of eukaryotic gene expression. Cigarette smoking is an important environmental risk factor for many diseases, and it has profound effects on gene expression. Using smoking status as a prediction target, we developed deep neural network predictive models using gene, exon, and isoform level quantifications from RNA sequencing data in 2,557 subjects in the COPDGene Study. We observed that models using exon and isoform quantifications clearly outperformed gene-level models when using data from 5 genes from a previously published prediction model. Whereas the test set performance of the previously published model was 0.82 in the original publication, our exon-based models including an exon-to-isoform mapping layer achieved a test set AUC (area under the receiver operating characteristic) of 0.88, which improved to an AUC of 0.94 using exon quantifications from a larger set of genes. Isoform variability is an important source of latent information in RNA-seq data that can be used to improve clinical prediction models.Entities:
Mesh:
Substances:
Year: 2021 PMID: 34634029 PMCID: PMC8530282 DOI: 10.1371/journal.pcbi.1009433
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1Visual abstract.
(a) Dataset split and usage. The number in each cell represents the number of subjects. The training set is equally split into 5 folds for deep learning model optimization (cross-validation for tuning the hyperparameters and architecture search in a deep learning model). The validation set is used to select the optimal model and the testing set is held out for performance evaluation. (b) Model overview. Our model consists of a Feature Selection Layer (FSL), an Isoform Map Layer (IML) (if the input feature is exon) and standard fully connected layers. FSL associates each input feature with a non-negative learnable weight, which represents the importance of features with respect to smoking status. IML encodes exon to isoform relationships via a binary matrix R, such that if exon i is contained within isoform j, we set R = 1, otherwise R = 0. By (element-wise) multiplying R with corresponding learnable weights W, we only consider canonical exon to isoform relationships.
Characteristics of subjects.
| Training | Validation | Testing | P-value | |
|---|---|---|---|---|
| Number of subjects | 1637 | 407 | 513 | |
| Age, years | 65.4 (58.6, 71.9) | 65.6 (58.4, 71.3) | 65.4 (58.6, 71.7) | 0.2 |
| Sex, %males | 51.1% | 55.8% | 49.9% | 0.2 |
| Race, %non-Hispanic whites | 74.3% | 74.9% | 77.8% | 0.3 |
| BMI | 28.1 (24.5, 32.3) | 28.1 (25.0, 32.1) | 27.9 (25.1, 32.2) | 0.4 |
| Smoking pack-years | 40.0 (28.0, 54.8) | 40.0 (26.9, 52.7) | 40.0 (28.0, 57.9) | 0.8 |
| Current smokers, % | 35.4% | 35.4% | 35.5% | 0.9 |
| FEV1, %predicted | 81.5 (62.7, 95.2) | 84.1 (66.8, 97.3) | 82.2 (63.7, 96.2) | 0.08 |
| FEV1/FVC | 0.71 (0.61, 0.78) | 0.72 (0.62, 0.78) | 0.72 (0.62, 0.79) | 0.7 |
| COPD case status, % | 31.7% | 28.6% | 29.3% | 0.3 |
All values are from the COPDGene visit 2. BMI: Body mass index; FEV1: Forced expiratory volume in 1 second; FVC: Forced vital capacity; GOLD: Global Initiative for Chronic Obstructive Lung Disease; COPD case status defined as subjects with GOLD spirometric grade ≥ 2. Variables are expressed as medians and interquartile ranges (25th to 75th percentiles) for continuous variables, and percentages for categorical variables. P-values are obtained using the Kruskal-Wallis test for the continuous variables and chi-square test for the proportions.
Predictive performance of modified Beineke models using gene, isoform and exon-level expression data.
| Val—Accuracy | Val—AUC | Test—Accuracy | Test—AUC | |
|---|---|---|---|---|
| Gene | 0.698 | 0.758 | 0.743 | 0.780 |
| Isoform | 0.757 | 0.827 | 0.774 | 0.828 |
| Exon | 0.801 | 0.859 | 0.808 | 0.869 |
| Exon, GML-GTF | 0.771 | 0.807 | 0.789 | 0.811 |
| Exon, GML-GTF, FSL | 0.776 | 0.805 | 0.741 | 0.796 |
| Exon, IML-GTF |
| 0.876 | 0.825 | 0.870 |
| Exon, IML-GTF, FSL |
|
|
|
|
Val: validation data. AUC: area under the receiver operating characteristic. IML-GTF: Isoform Map Layer containing information from Ensembl GTF file. GML-GTF: Gene Map Layer containing information from Ensembl GTF file. FSL: Feature Selection Layer. Best results are shown in bold.
Fig 2ROC curves in test data for the 4-gene modified Beineke model using gene (black), isoform (blue), and exon-level (red) quantifications.
Isoform and exon-level data outperform gene-level data (Delong p = 0.002 and <0.001, respectively).
Fig 3Cross-validation accuracy calculated during model optimization for exon-level data.
Predictive performance of various models using exon-level data, including elastic net for comparison.
| Val—Accuracy | Val—AUC | Test—Accuracy | Test—AUC | |
|---|---|---|---|---|
| Exon, Elastic Net | 0.821 | 0.861 | 0.774 | 0.903 |
| Exon + Iso, Elastic Net | 0.808 | 0.894 | 0.766 | 0.884 |
| Exon Base | 0.813 | 0.886 | 0.842 | 0.913 |
| Exon, GML-GTF | 0.833 | 0.899 | 0.842 | 0.913 |
| Exon, GML-GTF, FSL | 0.850 | 0.903 | 0.838 | 0.919 |
| Exon, IML-GTF | 0.843 | 0.905 | 0.854 | 0.924 |
| Exon, IML-GTF, FSL |
|
|
|
|
Val: validation data. AUC: area under the receiver operating characteristic. Exon + Iso: Concatenation of exon and isoform data. IML-GTF: Isoform Map Layer containing information from GTF file. GML-GTF: Gene Map Layer containing information from Ensembl GTF file. FSL: Feature Selection Layer. Best results are shown in bold.
Fig 4ROC curves in test data for the deep learning base exon model (black) and the model including the isoform map layer and feature selection layer (red) which has significantly better performance (Delong test p = 0.02).
Fig 5ROC curves in test data for the serum cotinine (black) and the exon model including the (Exon-to-)isoform map layer and feature selection layer (red) which has significantly better performance (Delong test p = 0.01).
Top 10 enriched GO pathways.
| Go.ID | Term | Annotated | Significant | Expected | p-value |
|---|---|---|---|---|---|
| GO:0032092 | positive regulation of protein binding | 9 | 6 | 1.87 | 0.0036 |
| GO:0043547 | positive regulation of GTPase activity | 24 | 11 | 4.99 | 0.005 |
| GO:0031397 | negative regulation of protein ubiquitination | 9 | 5 | 1.87 | 0.0076 |
| GO:0006892 | post-Golgi vesicle-mediated transport | 5 | 4 | 1.04 | 0.0076 |
| GO:0006998 | nuclear envelope organization | 5 | 4 | 1.04 | 0.0076 |
| GO:0015696 | ammonium transport | 5 | 4 | 1.04 | 0.0076 |
| GO:0032722 | positive regulation of chemokine production | 5 | 4 | 1.04 | 0.0076 |
| GO:0010950 | positive regulation of endopeptidase activity | 17 | 8 | 3.53 | 0.0086 |
| GO:0032885 | regulation of polysaccharide biosynthetic process | 3 | 3 | 0.62 | 0.0089 |
| GO:0048199 | vesicle targeting, to, from or within Golgi | 3 | 3 | 0.62 | 0.0089 |