| Literature DB >> 34220314 |
Yinhe Feng1,2, Yubin Wang1, Chunfang Zeng2, Hui Mao1.
Abstract
Chronic airway diseases are characterized by airway inflammation, obstruction, and remodeling and show high prevalence, especially in developing countries. Among them, asthma and chronic obstructive pulmonary disease (COPD) show the highest morbidity and socioeconomic burden worldwide. Although there are extensive guidelines for the prevention, early diagnosis, and rational treatment of these lifelong diseases, their value in precision medicine is very limited. Artificial intelligence (AI) and machine learning (ML) techniques have emerged as effective methods for mining and integrating large-scale, heterogeneous medical data for clinical practice, and several AI and ML methods have recently been applied to asthma and COPD. However, very few methods have significantly contributed to clinical practice. Here, we review four aspects of AI and ML implementation in asthma and COPD to summarize existing knowledge and indicate future steps required for the safe and effective application of AI and ML tools by clinicians. © The author(s).Entities:
Keywords: artificial intelligence; asthma; chronic airway diseases; chronic obstructive pulmonary disease; machine learning
Mesh:
Year: 2021 PMID: 34220314 PMCID: PMC8241767 DOI: 10.7150/ijms.58191
Source DB: PubMed Journal: Int J Med Sci ISSN: 1449-1907 Impact factor: 3.738
Summary of common machine learning algorithms
| Type of machine learning algorithm | Description | References describing applications |
|---|---|---|
| Natural language processing | Natural language processing is a general term for a series of technical methods. It can be divided into natural language understanding (NLU) and natural language generation (NLG). NLU focuses on how to understand text, while NLG focuses on how to generate natural text after understanding the text. | |
| K nearest neighbor | K nearest neighbor is a type of instance-based learning algorithm, and the training process simply memorize the training data. It categorizes the sample according to the similarity. The similarity is calculated using measures such as Euclidean distance and Hamming distance. | |
| Random forest | Random forest is an ensemble learning method. It contains multiple decision trees and integrates these decision trees to category of data. The size of trees and the number of variables usually determine the performance of model. | |
| Support vector machine | Support vector machine is usually used for classification and regression. It learns the optimal hyperplane to classify data. Generally, it has low misclassification error and scale well to high-dimensional data. However, selecting the optimal kernel function is essential. | |
| Artificial neural network | This is a kind of hierarchical nonlinear mapping network based on neurons and activation functions. Its structure includes three main parts, namely input layer, hidden layer and output layer. This structure is used to analyze variables in order to predict an outcome. The primary limitation is the underlying model's lack of transparency. | |
| Latent class analysis | Latent class analysis is a statistically principled technique that is used in factor analysis, cluster analysis, and regression. It is to explain and estimate the association between manifest indicators by latent class variables. This method suits to classify subgroups in large and heterogeneous data. | |
| K-means | This method divides the dataset into K clusters, and each cluster is represented by the average value of all samples in the cluster, which is called the "centroid". K-means clustering is easy to interpret and computationally efficient. However, the number of clusters needs to be prespecified. | |
| Logistic regression | Logistic regression estimates the probability of a binary classification problem. The dependent variable of it obeys the Bernoulli distribution, and nonlinear factors are introduced through the Sigmoid function. | |
| Decision tree | Decision tree creates a series of decision rules to predict categorical and continuous outcomes based on input variables. It contains three main parts: a root node, leaf nodes and branches. Decision tree is easy to understanding, but unstable and prone to overfitting. | |
| Lasso regression | Lasso regression is a linear regression method using L1-regularization. L1-regularization can compress the coefficients of variables and change some coefficients to zero, so as to achieve the purpose of variable selection. | |
| Naïve Bayes | Naïve Bayes is a classification algorithm based on Bayes' theorem, which is suitable for scenarios where variables are independent of each other. It is relatively simple and has good performance in the presence of noise, missing data, and irrelevant variables. |
Machine learning studies on asthma
| Reference | Category | Study population | ML algorithms | Input features | Studied outcome | Results | Critical appraisal of the study |
|---|---|---|---|---|---|---|---|
| Wi CI, 2017 | Screening and diagnosis | 927 children: | NLP | Clinical (EMRs) | Pediatric asthmatic subjects or not | Sensitivity = 97%, specificity = 95%, positive predictive value = 90%, negative predictive value = 98% | Pros: use of electronic medical records |
| Wi CI, 2018 | Screening and diagnosis | 595 children: | NLP | Clinical (EHR) | Pediatric asthmatic subjects or not | Sensitivity = 92%, specificity = 96%, positive predictive value = 89%, negative predictive value = 97% | Pros: use of an external electronic medical records system |
| Kaur H, 2018 | Screening and diagnosis | 514 children: | NLP | Clinical (EHR) | Pediatric asthmatic subjects or not | Sensitivity = 86%, specificity = 98%, positive predictive value = 88%, negative predictive value = 98% | Pros: development of the first algorithm to automatically extract patients who meet the Asthma Predictive Index criteria |
| Alizadeh B, 2015 | Screening and diagnosis | 254 subjects: | ANN | Clinical | Asthmatic subjects or not | Accuracy = 100% | Pros: based on 13 clinical characteristics used by physicians to diagnose asthma |
| Amaral J, 2017 | Screening and diagnosis | 75 stable asthma patients: 39 with airway obstruction and 36 without | KNN, RF, ADAB, FDSC | Forced oscillation technique parameters | Airway obstruction | KNN reached the highest accuracy range (AUC = 0.91) | Pros: use of the non-invasive forced oscillation technique |
| Amaral J, 2020 | Screening and diagnosis | 97 individuals: | KNN, RF, ADAB, SVM | Forced oscillation technique parameters | Asthmatic or restrictive respiratory diseases subjects | All classifiers achieved high accuracy (AUC≥0.9) | Pros: differential diagnosis of asthma and restrictive respiratory diseases |
| Zhan J, 2020 | Screening and diagnosis | 355 asthma patients and 1,480 healthy individuals | Mahalanobis-Taguchi system | Routine blood biomarkers | Asthmatic subjects or not | Accuracy = 94.15% in asthma patients and 97.20% in healthy individuals | Pros: diagnosis of asthma based on routine blood biomarkers |
| Sinha A, 2017 | Screening and diagnosis | 89 asthmatic subjects and 20 healthy controls | RF | Nuclear magnetic resonance spectra of exhaled breath condensate | Asthmatic subjects or not | Sensitivity = 80%, specificity = 75% | Pros: advocated the use of exhaled breath condensate spectral signatures |
| Islam MA, 2018 | Screening and diagnosis | 60 subjects: | ANN, SVM | Clinical (lung sounds) | Normal or asthmatic subjects | Accuracy = 89.2(±3.87)% in ANN and 93.3(±3.10)% in SVM | Pros: used lung respiratory sound signals |
| Singh OP, 2018 | Screening and diagnosis | non-asthmatic = 30 | SVM, KNN, NB | Respired carbon dioxide waveform | Asthmatic subjects or not | Accuracy = 94.52%, sensitivity = 97.67%, and specificity = 90% in SVM | Pros: non-invasive, patient-independent method based on simple signal processing algorithm to screen for asthma |
| Tomita K, 2019 | Screening and diagnosis | 566 adult out-patients (367 asthma patients) | SVM, DNN | Clinical, Lung function test, Bronchial challenge test | Adult asthmatic subjects or not | Accuracy = 98% in DNN and 82% in SVM | Pros: models based on symptoms, physical signs and objective tests |
| Couto M, 2015 | Classification and assessment | asthmatic athletes = 150 | LCA | Clinical (athletes' records) | Asthmatic phenotypes | Two phenotypes: atopic asthma and sports asthma | Pros: identification of asthmatic athlete phenotypes |
| Chen Q, 2012 | Classification and assessment | 689 asthma children | LCA, BIC | Clinical (questionnaire data) | Asthmatic phenotypes | Four phenotypes: never/infrequent, early-transient, early-persistent, and late-onset | Pros: identification of phenotypes based on wheeze |
| Weinmayr G, 2013 | Classification and assessment | >4,000 asthma children | LCA, BIC | Clinical (questionnaire), Bronchial hyperresponsiveness | Childhood asthma phenotypes | Seven phenotypes: one corresponding to healthy children; three related to wheeze; three related to congestion and coughed-up phlegm | Pros: identification of phenotypes according to respiratory symptoms |
| Bochenek G, 2014 | Classification and assessment | 201 aspirin-exacerbated respiratory disease patients | LCA | Clinical (questionnaire, spirometry, blood eosinophilia, urinary LTE4 concentrations) | Subphenotypes within AERD phenotype | Four subphenotypes: asthma with a moderate course; asthma with a mild course; asthma with a severe course; poorly controlled | Pros: identification of aspirin-exacerbated respiratory disease phenotypes |
| Havstad S, 2014 | Classification and assessment | 594 asthma children (2 years old) | LCA | Serum IgE data on 10 allergens | Atopic asthma phenotypes | Four phenotypes: low to no sensitization; highly sensitized; milk and egg dominated; peanut and inhalant(s)/no milk | Pros: examination of a more recently born, younger, and racially mixed cohort |
| Ross MK, 2018 | Classification and assessment | 1,019 children from the CAMP study and 669 children from the ACRN/CARE dataset | PP | Clinical | Pediatric asthma phenotypes | Four phenotypes: allergic-not-obese, obese-not-allergic, allergic-and-obese, and not-obese-not-allergic | Pros: discovery of more detailed predictive features for long-term asthma control other than the current control state |
| Wu W, 2019 | Classification and assessment | 346 adult asthma in the Severe Asthma Research Program | Multiple-kernel k-means | Clinical, physiological, inflammatory, demographic | Asthma control state | Four phenotypes: clusters 1 and 2: young modestly corticosteroid responsive allergic asthmatics with relatively normal lung function; cluster 3: late onset asthmatics with low lung function; cluster 4: primarily young obese females with severe airflow limitation | Pros: identification of phenotypes based on corticosteroid responses |
| Prosperi MC, 2014 | Classification and assessment | 554 asthma adults | LR, RF, DT, AB | Clinical, genetic | Current asthma, wheeze, eczema | Optimal AUC = 0.84, 0.76 and 0.64 for asthma, wheeze, and eczema, respectively | Pros: integrated genomics information |
| Krautenbacher N, 2019 | Classification and assessment | 260 individuals: | Lasso regression, elastic net, RF | Genetic, immunological, environmental | Asthma phenotypes | AUC for three classes of phenotypes = 0.81 | Pros: identification of three important genes for classifying childhood asthma phenotypes: PKN2, PTK2 and ALPP |
| Williams-De | Classification and assessment | 205 individuals | DT | Clinical, genetic, demographic | Asthma endotypes | Decision tree-based methods were useful tools for identifying asthma endotypes | Pros: integrated data to identify asthma endotypes |
| Siroux V, 2014 | Classification and assessment | 3,001 asthmatic adults | LCA | Clinical (questionnaire data), genetic | Asthma phenotypes | Four phenotypes: inactive/mild nonallergic asthma, inactive/mild allergic asthma, active allergic asthma, and active adult-onset nonallergic asthma | Pros: large sample of asthmatic adults |
| Mäkikyrö EM, | Classification and assessment | 1,995 asthma subjects | LCA | Clinical (questionnaire data), asthma-related healthcare use | Asthma phenotypes | Four subtypes for women: mild asthma, moderate asthma, unknown severity, and severe asthma. | Pros: development of a simpler way to categorize asthmatic subtypes |
| Nabi FG, 2019 | Classification and assessment | 55 asthma patients | Ensemble, SVM, KNN | Wheeze sounds | Asthma severity | The best positive predictive value for the mild, moderate, and severe samples were 95% (ensemble), 88% (ensemble) and 90% (SVM), respectively. | Pros: classified wheeze sounds of asthmatic patients according to severity |
| Moustris KP, 2012 | Management and monitoring | 3,602 children | ANN | Meteorological and ambient air pollution data | Childhood asthma admissions | Index of Agreement = 0.837 | Pros: predicted the childhood asthma admission based on the bioclimatic and air pollution |
| Messinger AI, 2019 | Management and monitoring | 128 asthmatic children: | ANN | Demographic, clinical (EHR) | Respiratory score | The performance of pediatric-automated asthma severity scores was better than Pediatric Asthma Score. | Pros: pARS had the potential to help standardize acute pediatric asthma care in the PICU. |
| Xiang Y, 2020 | Management and monitoring | 31,433 adult asthma patients | ANN | Clinical (EHR) | Asthma exacerbation | AUC = 0.7003 | Pros: a time-sensitive predictive model |
| Khatri KL, 2018 | Management and monitoring | Patients of visiting emergency departments in Dallas County for respiratory diseases | ANN | Clinical, meteorological and environmental pollution data | Emergency department visits | Overall accuracy = 81.0% | Pros: can serve as useful tool for peak demand prediction in emergency departments |
| Grunwell JR, 2020 | Management and monitoring | 513 asthmatic children | LCA | Clinical, demographics | Asthma exacerbation | The class of multiple sensitizations with partially reversible airflow limitation had the highest exacerbation risk (64.3%) | Pros: prediction of exacerbation in school-age children |
| Fitzpatrick AM, 2020 | Management and monitoring | 2,593 children with mild to moderate asthma aged 5-18 years | LCA | Clinical, demographics, lung function test | Lung function and exacerbation rate | Children who had multiple sensitizations with partially reversible airflow limitation had the highest exacerbation risk (52.5%) | Pros: large sample size of diverse and representative children across the United States |
| Das LT, 2017 | Management and monitoring | 2,691 asthmatic children | LR, Lasso regression, RF, SVM | Clinical (EHR) | Emergency department visits | AUC = 0.86 reached by LR | Pros: based on electronic health records (EHRs) |
| Zhang O, 2020 | Management and monitoring | 2,010 asthma patients | LR, DT, NB, perceptron algorithms | Daily monitoring data | Asthma exacerbations | AUC = 0.85, sensitivity = 90%, and specificity = 83% reached by LR | Pros: use of a large international dataset to detect severe asthma exacerbations |
| Luo L, 2018 | Management and monitoring | 6,813 admission records | XGBoost | Search index, air pollution data, weather data, historical admissions | Asthma admission | AUC = 0.832 | Pros: use of an easily accessible and daily updated daily search index |
| Ram S, 2015 | Management and monitoring | Emergency department visits for asthma to the Children's Medical Center of Dallas (between October 2013 and December 2013) | ANN | Twitter data, Google search interests, environmental data | Emergency department visits | Accuracy = 70% | Pros: based on real-time environmental and internet-based data |
| Finkelstein J, 2016 | Management and monitoring | 7,001 records submitted by adult asthma patients | NB, BN, SVM | Daily self-monitoring reports | Asthma exacerbations | BN model reached sensitivity, specificity, and accuracy of 100% | Pros: use of home telemonitoring data |
| Huffaker MF, 2018 | Management and monitoring | 33 subjects | RF | Recorded physiologic data | The time period during which onset of asthma symptoms occurred | Sensitivity = 47.2%, specificity = 96.3%, accuracy = 87.4% | Pros: showed that passive physiologic monitoring can be used in the home to assess asthma control |
| Luo L, 2020 | Management and monitoring | Cost data of asthmatic patients | LR, RF, SVM, classification regression tree, backpropagation neural network | Cost data | Treatment cost | AUC and sensitivity increase of 46.89% and 101.07%, respectively | Pros: use of machine learning to predict high cost |
| Khasha R, 2019 | Management and monitoring | 96 asthma patients | LR, XGBoost, RF, DT, KNN, NB, SVM | Clinical, demographics, lung function test | Control level | Optimal accuracy = 91.66% | Pros: developed a novel ensemble learning method for asthma control level detection |
| Tsang K, 2020 | Management and monitoring | 5,875 asthma patients | LR, NB, DT, SVM | mHealth data | Stable and unstable periods | Optimal sensitivity = 86.6%, optimal specificity = 72.5%, optimal AUC = 0.871 | Pros: personalized algorithms to enhance asthma management |
| Hosseini SA, 2020 | Treatment | 80 patients with mild or moderate allergic asthma | ANN | Clinical, immunologic, hematologic, demographic | Low to high level of effect | Accuracy>99% | Pros: new machine learning model for the prediction of asthmatic drug effectiveness |
Abbreviations: AB, AdaBoost; ADAB, AdaBoost with decision trees; AERD, aspirin-exacerbated respiratory disease; ANN, artificial neural networks; AUC, area under the receiver operating characteristic curve; BN, Bayesian networks; BIC, Bayesian Information Criterion; DT, decision trees; DNN, deep neural network; EMRs, electronic medical records; EHR, electronic health records; FDSC, feature-based dissimilarity space classifier; KNN, k-nearest neighbor; LCA, latent class analysis; LR, logistic regression; NB, naïve Bayesian; NLP, natural language processing; PP, predictor pursuit; RF, random forest; SVM, support vector machine.
Machine learning studies on chronic obstructive pulmonary disease
| Reference | Category | Study population | ML algorithms | Input features | Studied outcome | Results | Critical appraisal of the study |
|---|---|---|---|---|---|---|---|
| Matsumura K, 2020 | Screening and diagnosis | non-smokers 68 | RF | Genetic (transcriptomic data) | Smokers or early stage of COPD | Each group with 65% accuracy. | Pros: identification of novel genes associated with COPD |
| Zheng H, 2020 | Screening and diagnosis | COPD patients = 54 | SVM | Serum metabolic biomarkers | COPD subjects or not | Accuracy = 84.62%, AUC = 0.90 | Pros: based on serum metabolomics |
| Haider NS, 2020 | Screening and diagnosis | COPD patients = 30 | SVM, KNN, LR, DT, DA | Clinical (lung sound), spirometry features | COPD subjects or not | Optimal accuracy = 100% | Pros: combination of spirometry data with lung sound features for COPD diagnosis |
| Spathis D, 2019 | Screening and diagnosis | 132 patients | NB, LR, ANN, SVM, KNN, DT, RF | Clinical, demographic | Asthma or COPD | Optimal accuracy = 97.7% | Pros: identification of COPD based on 22 different clinical features |
| Al Sallakh MA, 2018 | Screening and diagnosis | Secure Anonymised Information Linkage (SAIL) Databank | LCA | Clinical (EHR) | Asthma-COPD overlap | A protocol | Pros: based on electronic health records (EHRs) |
| Pikoula M, 2019 | Classification and assessment | 30,961 COPD patients | K-means, hierarchical clustering | Clinical (EHR) | COPD phenotypes | Five phenotypes: anxiety/depression; non-comorbid; cardiovascular/diabetes; severe COPD/frailty; obesity/atopy | Pros: identification of phenotypes based on EHRs |
| Burgel PR, 2017 | Classification and assessment | 6,060 COPD patients | CART | Clinical | COPD phenotypes | Five phenotypes: mild respiratory, moderate-to-severe respiratory, moderate-to-severe comorbid/obese, very severe respiratory, very severe comorbid | Pros: integrated respiratory characteristics and comorbidities |
| Yoon HY, 2019 | Classification and assessment | 1,195 COPD patients | K-means | Clinical (seven variables) | COPD phenotypes | Four phenotypes: putative asthma-COPD overlap, mild COPD, moderate COPD, severe COPD | Pros: demonstrated that phenotype is linked to the occurrence of acute exacerbation |
| Kim WJ, 2018 | Classification and assessment | 1,676 COPD patients from 13 Asian cities | Hierarchical cluster analysis | Clinical | COPD phenotypes | Three phenotypes: worse lung function and fewer symptoms, worse lung function and more symptoms. milder COPD and a preserved FEV1 and FEV1/FVC ratio | Pros: identification of COPD subgroups in a large Asian sample |
| Castaldi PJ, 2014 | Classification and assessment | 10,192 smokers | K-means | Clinical | COPD phenotypes | Four phenotypes: relatively resistant smokers, mild upper zone emphysema-predominant, airway disease-predominant, severe emphysema | Pros: identification of phenotypes based on airway disease and emphysema |
| Bodduluri S, 2020 | Classification and assessment | 8980 individuals | DNN, RF | Spirometry data | Chest CT phenotypes (normal, airway predominant, emphysema predominant, and mixed emphysema/airway) | The DNN model had the highest accuracy (AUC = 0.80 and 0.91) | Pros: used spirometry data to train the model |
| Gawlitza J, 2019 | Classification and assessment | 75 COPD patients | KNN, XGBoost, ANN | Quantified computed tomography | Pulmonary function | KNN model with the lowest mean relative error (16%) | Pros: prediction of lung function values from quantitative computed tomography parameters |
| Westcott A, 2019 | Classification and assessment | 95 COPD patients | LR, SVM | Thoracic computed tomography | Lung ventilation | Accuracy = 88%, AUC = 0.82 | Pros: development of a computed tomography analysis pipeline |
| González G, 2018 | Classification and assessment | 8,983 COPDGene participants and 1,672 ECLIPSE participants | Convolutional neural network | Chest computed tomography | COPD, stage, acute respiratory disease events, mortality | C-index = 0.856, accuracy = 51.1% in COPDGene cohort | Pros: based on chest computed tomography images |
| Peng J, | Classification and assessment | 410 hospitalized AECOPD patients | DT | Clinical (medical records) | Mild and severe AECOPD | Accuracy = 80.3% | Pros: fast identification of the deterioration and death risk of AECOPD patients |
| Goto T, | Management and monitoring | 44,929 hospitalized COPD patients | Lasso regression, DNN | Clinical | 30-day readmission | C-statistic = 0.61 | Pros: huge sample size and more than 1000 predictors |
| Min X, | Management and monitoring | 111,992 patients from the Geisinger Health System | LR, RF, SVM, GBDT, MLP | Medical claims data | 30-day readmission | Optimal AUC = 0.653 | Pros: combined knowledge and data driven features |
| Cavailles A, 2020 | Management and monitoring | 143,006 patients hospitalized for AECOPD | DT | Clinical | Risk of readmission | Previous admission times was the most important risk of readmission | Pros: identification of variables associated with readmission |
| Chen W, | Management and monitoring | 4,167 subjects | RF | Clinical, spirometry | Prebronchodilator FEV1, risk of airflow limitation | C-statistic = 0.86-0.87 | Pros: development of a personalized risk model to predict the risk of airflow limitation |
| Ma X, | Management and monitoring | COPD patients = 441 | KNN, LR, DT, SVM, ANN, XGBoost | Genetic, clinical | Early-stage COPD | KNN and LR had the highest precision (82%) and accuracy (81%) | Pros: identification of the association of genes and COPD development |
| Lanclus M, 2019 | Management and monitoring | 62 COPD patients | SVM | Functional respiratory imaging | COPD exacerbations | Accuracy = 80.65%, positive predictive value = 82.35% | Pros: use of functional respiratory imaging for AECOPD prediction |
| Wang C, 2020 | Management and monitoring | AECOPD patients = 135 | RF, SVM, LR, KNN, NB | Clinical (EMRs) | COPD acute exacerbations | Optimal sensitivity = 80%, specificity = 83%, positive predictive value = 81%, negative predictive value = 85%, and AUC = 0.90 from SVM | Pros: decision support for clinicians |
| Luo L, | Management and monitoring | 780,295 hospitalizations data | LR, RF, XGBoost | Medical insurance data | High-cost COPD patients | AUC = 0.787 (LR); AUC = 0.792 (RF); AUC = 0.801 (XGBoost) | Pros: identification of high costs for COPD patients |
| Morales DR, 2018 | Management and monitoring | 54879 COPD patients | LR, SVM | Clinical | 1-year mortality | C-statistic = 0.723 | Pros: use of external data to validate models |
| Moll M, | Management and monitoring | 2,632 participants from COPDGene cohort and 1,268 participants from ECLIPSE cohort | RF | Clinical, spirometry, imaging | Time to death from any cause | C-index ≥ 0.7 in both cohorts | Pros: prediction of all-cause mortality |
| Orchard P, 2018 | Treatment | 135 COPD patients | Sparse maximum-margin classifier, ensembles of boosted classifier, multitask neural network model | Clinical (telemonitoring data), weather | Admission and initiation of oral corticosteroid treatment | Optimal AUC = 0.74 | Pros: the model serves as a guide for corticosteroid therapy |
Abbreviations: ANN, artificial neural networks; AUC, the area under the curve; BN, bayesian networks; CART, classification and regression tree; DA, discriminant analysis; DNN, deep neural network; DT, decision trees; EMRs, electronic medical records; EHR, electronic health records; GBDT, gradient boosting decision tree; KNN, k-nearest neighbors; LCA, latent class analysis; LR, logistic regression; MLP, multi-layer perceptron; NB, naïve Bayes; RF-random forest; SVM-support vector machine.