| Literature DB >> 36212148 |
Yao Dong1,2,3, Shaoze Zhou2, Li Xing4, Yumeng Chen1,3, Ziyu Ren1,3, Yongfeng Dong1,3, Xuekui Zhang2.
Abstract
Deep Learning (DL) has been broadly applied to solve big data problems in biomedical fields, which is most successful in image processing. Recently, many DL methods have been applied to analyze genomic studies. However, genomic data usually has too small a sample size to fit a complex network. They do not have common structural patterns like images to utilize pre-trained networks or take advantage of convolution layers. The concern of overusing DL methods motivates us to evaluate DL methods' performance versus popular non-deep Machine Learning (ML) methods for analyzing genomic data with a wide range of sample sizes. In this paper, we conduct a benchmark study using the UK Biobank data and its many random subsets with different sample sizes. The original UK Biobank data has about 500k participants. Each patient has comprehensive patient characteristics, disease histories, and genomic information, i.e., the genotypes of millions of Single-Nucleotide Polymorphism (SNPs). We are interested in predicting the risk of three lung diseases: asthma, COPD, and lung cancer. There are 205,238 participants have recorded disease outcomes for these three diseases. Five prediction models are investigated in this benchmark study, including three non-deep machine learning methods (Elastic Net, XGBoost, and SVM) and two deep learning methods (DNN and LSTM). Besides the most popular performance metrics, such as the F1-score, we promote the hit curve, a visual tool to describe the performance of predicting rare events. We discovered that DL methods frequently fail to outperform non-deep ML in analyzing genomic data, even in large datasets with over 200k samples. The experiment results suggest not overusing DL methods in genomic studies, even with biobank-level sample sizes. The performance differences between DL and non-deep ML decrease as the sample size of data increases. This suggests when the sample size of data is significant, further increasing sample sizes leads to more performance gain in DL methods. Hence, DL methods could be better if we analyze genomic data bigger than this study.Entities:
Keywords: deep learning; disease prediction; genomic analysis; hit curve; imbalance data; machine learning
Year: 2022 PMID: 36212148 PMCID: PMC9537734 DOI: 10.3389/fgene.2022.992070
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
FIGURE 1A workflow diagram of the study process. We perform data preprocessing on 502,524 sample sets from UK Biobank. After the initial assessment and quality control, the data is retained for 205,238 cases with detailed procedures in. There are 27,692 asthma cases, 6,449 COPD cases, and 1,202 lung cancer cases. Age, sex, BMI, FEVIZ, and smoking status are covariates. 2,000 SNPs are retained after filtering and screening the original 2 million SNPs. The retained dataset was divided into ten subsets per-sample sets from 10 to 100%. We split the data by disease status into 70% as training and 30% as testing sets. This study uses three non-deep ML models (Elastic net, XGBoost, and SVM) and two DL models (DNN and LSTM) to construct the prediction models. Finally, the model performance is evaluated by the metrics, such as precision, recall, F1-score, AUC, and hit curve.
Descriptive statistics of the dataset. This table gives the relationships between smoking status and other covariates, i.e., age, sex, BMI, FEV1Z score, asthma status, COPD status, and lung cancer status.
| Covariates | Never smoked | Previously smoked | Currently smokes |
|---|---|---|---|
| Age | |||
| | 47,137 (42.1%) | 22,112 (29.3%) | 8,269 (46.6%) |
| ≥55 years | 64,826 (57.9%) | 53,414 (70.7%) | 9,480 (53.4%) |
| Sex | |||
| Male | 69,300 (61.9%) | 39,670 (52.5%) | 8,912 (50.2%) |
| Female | 42,663 (38.1%) | 35,856 (47.5%) | 8,837 (49.8%) |
| BMI_mean | 27.00 (±4.67) | 27.83 (±4.68) | 26.93 (±4.65) |
| FEV1Z_mean | 0.31 (±1.05) | 0.44 (±1.10) | 0.85 (±1.17) |
| Asthmastatus | 15,110 (13.5%) | 10,343 (13.7%) | 2,239 (12.6%) |
| COPDstatus | 1,350 (1.2%) | 3,338 (4.4%) | 1,761 (9.9%) |
| Cancerstatus | 185 (0.17%) | 627 (0.83%) | 390 (2.2%) |
FIGURE 2Models performance of the 5 methods with the 10 different sample size for predicting asthma, COPD, and lung cancer, respectively. Performances are shown by precision, recall, and F1-score. the shaded parts are the 1 standard error confidence bounds.
FIGURE 3Hit curve graphs of AsthmaStatus, COPDStatus and CancerStatus classification by five models on 10–100% data sets. The x-axis represents the number of test subjects we selected by sorting the estimated probability up to down. The y-axis of the hit curve chart represents the number of subjects with certain conditions which are correctly diagnosed in the test set. The point (m 1, m 2) indicates there are m 2 patients in the first m 1 selected subjects are correctly predicted as diseased. The curves show the average hit curves of five models, and the shaded area denotes the confidence bounds constructed using 10-fold cross-validation (i.e. ± one standard error). The brown bar at the bottom means non-deep ML models are significantly better than DL models.