Literature DB >> 34151987

The application of artificial intelligence and data integration in COVID-19 studies: a scoping review.

Yi Guo1,2, Yahan Zhang3, Tianchen Lyu1,2, Mattia Prosperi4, Fei Wang5, Hua Xu6, Jiang Bian1,2.   

Abstract

OBJECTIVE: To summarize how artificial intelligence (AI) is being applied in COVID-19 research and determine whether these AI applications integrated heterogenous data from different sources for modeling.
MATERIALS AND METHODS: We searched 2 major COVID-19 literature databases, the National Institutes of Health's LitCovid and the World Health Organization's COVID-19 database on March 9, 2021. Following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guideline, 2 reviewers independently reviewed all the articles in 2 rounds of screening.
RESULTS: In the 794 studies included in the final qualitative analysis, we identified 7 key COVID-19 research areas in which AI was applied, including disease forecasting, medical imaging-based diagnosis and prognosis, early detection and prognosis (non-imaging), drug repurposing and early drug discovery, social media data analysis, genomic, transcriptomic, and proteomic data analysis, and other COVID-19 research topics. We also found that there was a lack of heterogenous data integration in these AI applications. DISCUSSION: Risk factors relevant to COVID-19 outcomes exist in heterogeneous data sources, including electronic health records, surveillance systems, sociodemographic datasets, and many more. However, most AI applications in COVID-19 research adopted a single-sourced approach that could omit important risk factors and thus lead to biased algorithms. Integrating heterogeneous data for modeling will help realize the full potential of AI algorithms, improve precision, and reduce bias.
CONCLUSION: There is a lack of data integration in the AI applications in COVID-19 research and a need for a multilevel AI framework that supports the analysis of heterogeneous data from different sources.
© The Author(s) 2021. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For permissions, please email: journals.permissions@oup.com.

Entities:  

Keywords:  coronavirus; deep learning; machine learning; natural language processing; neural networks

Mesh:

Year:  2021        PMID: 34151987      PMCID: PMC8344463          DOI: 10.1093/jamia/ocab098

Source DB:  PubMed          Journal:  J Am Med Inform Assoc        ISSN: 1067-5027            Impact factor:   7.942


INTRODUCTION

In just a few months, the 2019 novel coronavirus disease (COVID-19), severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has rapidly spread around the globe, and at the time of this writing, there are over 100 million confirmed COVID-19 cases and a few million confirmed deaths from COVID-19 worldwide. The COVID-19 pandemic is now the second deadliest pandemic in over 100 years, behind only the 1918 influenza pandemic (ie, Spanish Flu). While the COVID-19 pandemic is still raging, and the number of cases are growing exponentially, the scientific communities around the world have reacted promptly by directing effects and resources to research studies on the etiology, transmission, detection, treatment, and prevention and control of COVID-19. In about a year, an outstanding number of over 100 000 research articles on COVID-19-related topics have been published according to PubMed. Recent advances in artificial intelligence (AI) have provided novel methods and tools for combating global pandemics, such as COVID-19. In classic computer science textbooks, AI is broadly defined as the study of intelligent agents, machines or devices that can imitate human cognitive functions to learn the environment and take actions. The learning process is often implemented through mathematical or statistical models in computer programs. Machine learning, of which deep learning is a subset, is a branch of AI that trains algorithms that allow computer programs to automatically (ie, without explicit programming) improve through data. In the fields of public health and medicine, AI techniques—especially machine learning and, more recently, deep learning methods—have been widely used for disease surveillance, health risks and outcomes prediction, medical diagnostics and therapeutics, clinical decision-making, and many more. With surveillance tools, patient reporting systems, and clinical studies emerging quickly, large amounts of novel data have been rapidly accumulated during the COVID-19 pandemic. There is growing interest in leveraging these data to develop AI solutions for COVID-19 challenges. However, developing AI models in the era of precision health is not a trivial task. Precision health adopts a unified approach to understanding the full range of determinants of health for health promotion, prevention, diagnosis, and treatment., The vision of precision health can only be realized through the integration and examination of a comprehensive list of determinants of health that include genetic, biological, environmental, as well as social and behavioral factors. On the other hand, these determinants of health exist in various data sources that are heterogeneous in syntax (eg, file formats), schema (eg, data models and structures), and semantics (eg, meanings or interpretations of the variables). One of the first and most important challenges in building precision health AI models is integrating relevant data that contain determinants of health from the heterogeneous sources. In this study, we conducted a scoping review of AI applications in COVID-19 research with a focus on heterogeneous data integration. Our goal was to summarize the COVID-19 research areas in which AI is being applied, the AI models being used in these research applications, and the data sources being used to build the AI models. We were particularly interested in examining whether these AI applications integrated heterogenous data from different sources for building the models and treated missing data in the variables of interest. Although a few published reviews have summarized the applications of AI or machine learning methods in COVID-19 research, none of them examined data integration, and many focused on a specific area of COVID-19 research (eg, medical imaging). Note that we focused on the use of AI methods for data analysis and excluded other AI fields, such as robotics.

MATERIALS AND METHODS

Search strategy

We searched 2 major COVID-19 literature databases, the National Institutes of Health (NIH) LitCovid (part of PubMed) and the World Health Organization (WHO) COVID-19 database for articles published through March 9, 2021. LitCovid is an open-resource literature hub developed by the NIH for tracking up-to-date scientific information about COVID-19. It provides a central access to all COVID-19-related articles in PubMed. The WHO COVID-19 database contains global literatures of scientific findings and knowledge on COVID-19 gathered by the WHO. Both databases are updated daily with newly published articles. The following query and keywords were used to search the databases: “artificial Intelligence” or “machine learning” or “supervised learning” or “unsupervised learning” or “deep learning” or “neural networks” or “natural language processing.”

Literature screening

Following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guideline, we screened the articles retrieved from the databases in 2 rounds. First, we screened the titles and abstracts of the identified articles and excluded those that: (1) did not use any AI methods for data analysis, (2) were unrelated to COVID-19, (3) were reviews, editorials, opinions, letters to editor, or case reports, or (4) were not written in English. Second, we screened the full texts of the remaining articles to further exclude articles that met our exclusion criteria. Two reviewers (YZ and TL) independently reviewed all the articles in the 2 rounds of screening. Any conflicts between the 2 reviewers were reviewed and solved by a third reviewer (YG). We extracted and summarized COVID-19- and AI-related information from the retained articles.

RESULTS

Summary

We summarized our review procedure in a PRISMA flow diagram in Figure 1. We identified 1311 and 1218 studies in the LitCovid and WHO COVID-19 databases, respectively. After removing duplicated studies, we included 1338 studies in the first round of screening. In the first round of screening of titles and abstracts, 492 studies were excluded according to our exclusion criteria, while 846 studies were included in the full-text review. In the second round of screening, another 52 studies were excluded based on full-text review and eventually, 794 studies were included in the final qualitative analysis.
Figure 1.

Search and review procedure.

Search and review procedure. The AI applications covered in these 794 studies can be categorized into the following areas of COVID-19 research: Disease forecasting (n = 161), Medical imaging-based diagnosis and prognosis (n = 322); Early detection and prognosis (non-imaging) (n = 152); Drug repurposing and early drug discovery (n = 53); Social media data analysis (n = 44); Genomic, transcriptomic, and proteomic data analysis (n = 24); and Other COVID-19 research topics (survey studies, literature mining, surveillance, clinical trials, miscellaneous topics) (n = 38). We listed the full citations of all 794 studies by research area in the Supplementary Table S1. In the following sections, we summarized what and how AI techniques were applied in these areas. In particular, we determined whether the studies integrated heterogeneous data to expand the list of inputs (or predictors) for building the AI models. In line with Lenzerini 2002, we defined data integration as the action of combining data that are heterogeneous in syntax, schema, and semantics and extracting predictors from these data for modeling. The total number of studies and the number of studies with data integration in each research area were summarized in Figure 2.
Figure 2.

Number and percentage of studies with data integration in each research area.

Number and percentage of studies with data integration in each research area.

Disease forecasting

A total of 161 studies described the use of AI for COVID-19 forecasting (Supplementary Table S1). In these studies, 106 predicted future COVID-19 incidence or mortality using historical data only, 43 predicted future or confirmed COVID-19 cases using potential risk factors as inputs, 8 characterized country-level differences in COVID-19 outcomes worldwide (clustering studies), and 4 predicted future demands for hospital resources or medical consumables. The majority of the 106 studies on predicting future COVID-19 incidence or mortality used COVID-19 data from the Johns Hopkins University Center for Systems Science and Engineering, or local health authorities. In these studies, the long short-term memory (LSTM), a class of recurrent neural networks (RNN), was the most commonly used deep learning model. Other popular models included other types of artificial neural networks (ANN); machine learning models, such as random forest, support vector machines (SVM), and gradient boosting machine (GBM); statistical time series models, such as the autoregressive integrated moving average (ARIMA) model; and epidemiological models, such as the Susceptible-Infectious-Recovered and Susceptible-Exposed-Infected-Removed models. None of the 106 studies integrated heterogeneous data for modeling since only historical COVID-19 data were used as inputs. In the 43 studies on COVID-19 risk factors, 27 examined environmental exposures, while the remaining 16 examined a range of other risk factors, such as population characteristics, socioeconomic status, or other health-related factors. Most of these studies used machine learning models, among which random forest and GBM were the most popular algorithms. A small portion of these studies used ANN, among which the multilayer perceptron (MLP) was the most popular. Among these 43 studies, slightly over half (n = 24, 55.8%) integrated heterogeneous data on predictors for modeling (Table 1). Three of these studies imputed missing data. Two studies used simple mean or median imputation, while the third study used the k-nearest neighbor (k-NN) method (Table 1).
Table 1.

Studies on COVID-19 forecasting that integrated heterogeneous data

StudyRegionOutcomeData sourceModelHeterogeneous dataaMissing data imputation
Environmental factors
Brooks et al20WorldwideCOVID-19 mortality rateWorld Bank, Worldometer, Index Mundi, Wikipedia, Our World in Data, JHU, BCG Atlas, WHO, Oxford, GHS Indexk-means, linear regressionSocioeconomic, health system readiness, environmental, existing disease burden, demographics, vaccination programs, and response to the pandemicImputed with mean values
Cao et al21ChinaCOVID-19 incidence and growth rateChinese NHC, Baidu Qianxi, China Health & Family Planning Statistical Yearbook, China City Statistical Yearbook, CMA, CNICXGBoostTravel-related, medical, socioeconomic, environmental, and influenza-like illness factorsNo
Cazzolla-Gatti et al22ItalySARS-CoV-2 mortality and infectivityItalian Civil Protection, ARPA, I.Stat, EpiCentro, Italian MoH, ENAC, ACI.itRFEnvironmental, health, socioeconomic factorsNo
Chakraborti et al23WorldwideCOVID-19 incidence and deathsECDC, World Bank, GoogleRF, GBNatural (climatic, environmental) and human (socioeconomic, demographic) factorsNo
Gujral et al24USACOVID-19 incidenceJHU, US EPA,EDEMAir pollution, meteorological data, county-level demographicsNo
Haghshenas et al25ItalyCOVID-19 incidenceUnspecifiedANN (PSO, DE)Historical data, climate and urban factorsNo
Kasilingam et al26WorldwideCOVID-19 incidenceWHO, World Bank, Weather UndergroundLR, DT, RF, SVMInfrastructure, environment, policies, and infection-related factorsNo
Khan et al27ChinaCOVID-19 incidenceChinese NHC, IDIS, NBS, NCEP/NCARK-means, SIRTemperature, population density, and demographic informationNo
Kuo et al28USACOVID-19 incidenceNYT, USDA ERA, gridMET, Google, Federal Reserve Bank of DallasEN, PCR, PLSR, k-NN, RT, RF, GB, 2-layer ANNCounty-level demographic, environmental, and mobility dataImputed with median values
Li et al29WorldwideCOVID-19 incidence and deathsJHU, NOAA, KG system, CIA, Wikipedia, ESPN, CIES, Hupu, BBC, UN, WEO, World Bank, WHO, Knoema, FAO, OICA,LASSOFactors on politics, economy, culture, demographics, geography, education, medical resources, scientific development, environment, diseases, diet, and nutritionNo
Mollalo et al30USACOVID-19 incidenceUSAFacts (CDC, JHU CSSE), US Census, GHDxANN (MLP)Historical data, sociodemographic and environmental factors, disease mortalityNo
Nikolopoulos et al31USA, India, UK, Germany, SingaporeCOVID-19 incidence and growth rateWHO, JHU, Beihan University, Mayer Brown, WPR, WHR, World Bank, OECD, Google52 statistical, epidemiological, machine- and deep-learning modelsClimate, travel restrictions and curfews, population density, disease rates (lung, heart, diabetes), GDP spent on healthcare, air pollution, import data, Google trendsNo
Pourghasemi et al32IranCOVID-19 incidence and deathsIranian MOHME, Open Street Map, WorldClimRFHistorical data, anthropogenic and climatic factorsNo
Torrats-Espinosa33USACOVID-19 incidence and death rateUnspecifiedbDouble-Lasso RegressionCounty-level demographics, density and potential for public interaction, social capital, health risk factors, capacity of the healthcare system, air pollution, employment in essential businesses, and political viewsNo
Zawbaa et al34Italy, USA, China, Japan, Iran, Egypt, Alegria, Kenya, Cote d’IvoireCOVID-19 incidence and death rateJHU, ECDCANN (MLP)Average age, average weather temperature, BCG vaccination, malaria treatmentNo
Other factors
Cobb et al35USACOVID-19 incidenceUS local health departments, US CensusRFSIP orders, county metricsNo
Galvan et al36BrazilCOVID-19 incidence and deathsBrazil MoH, IBGE, SUS, BCB, ADHBANN (SOM)Socioeconomic, health, and safety dataNo
Hasan et al37BangladeshCOVID-19 incidenceWHO, IEDCR, surveyLSTM, ANFIS, ANN (MLP)Governing authorities, compliance, probability of infection and test positivityNo
Liu et al38ChinaCOVID-19 incidenceChina CDC, Baidu Search data, Media Cloud, GLEAMComplete linkage hierarchical clustering, LASSOOfficial health reports, COVID-19-related internet search activity, news media activity, daily forecasts of COVID-19 activityNo
Mehta et al39USACOVID-19 incidenceNYT, CDC, GHDxXGBoostCounty-level population statistics, county-level disease rate and mortalityNo
Pandit et al40WorldwideCOVID-19 mortality rateWHO, GSAIDLogitBoost, AdaboostM1Age, SARS-CoV-2 clade informationNo
Roy et al41USACOVID-19 incidence and deathsWPR, Wikipedia, KFF, AHRQ, Hud Exchange, Kaggle, Worldometer, Census Bureau, CDC, NYCOpenDataSVM, SGD, NC, DTs, Gaussian NBSocial, economic, environmental, demographic, ethnic, cultural and health factorsNo
Sun et al42USACOVID-19 incidenceLocal DOH, CMS, LTCF, NICSHCGBNursing home facility and community characteristicsImputed using k-NN
Ye et al43USACOVID-19 risk indicesWHO, CDC, Local DOH, Census Bureau, Google Maps, RedditcGAN, LSTMDisease related data, demographic, mobility and social media dataNo
Region differences (clustering)
Aydin et al44WorldwidePerformances against COVID-19Self-curated, Kagglek-means, hierarchic clusteringGDP, Poverty index, population, stringency index, smoking rate, CVD death rate, diabetes prevalenceImputed with mean values
Bird et al 45(p19)WorldwideCOVID-19 riskWorldometers, CIA, WHOK% binning discretization, SVM, DT, GB, NB, LDA, QDAPopulation, medical doctor density, tobacco use, obesity rate, GDP, land, migration, infant mortality, birth rate, death rateNo
Carrillo-Larco et al46WorldwideCOVID-19 incidenceJHU, GBD, UW, GHO, WHOk-meansHistorical data, diseases, environmental factors, sociodemographics, health system factorsNo
Lai et al47USACOVID-19 incidenceNYT, CDC, Census Bureau, USALEEP,k-meanspopulation census data, GIS data, business pattern censuses, and other sourcesNo

Data that are heterogeneous in syntax, schema, and semantics.

Available at https://doi.org/10.7910/DVN/JHFOSE.

ADHB: Human Development Atlas of Brazil; AHRQ: Agency for Healthcare Research and Quality; ANFIS: adaptive neuro fuzzy inference system; ANN: artificial neural network; ARIMA: autoregressive integrated moving average; ARPA: Regional Environmental Protection Agency; BBC: British Broadcasting Corporation; BCB: Central Bank of Brazil; BCG: Bacillus Calmette–Guérin; BGFS-PNN: Broyden-Fletcher-Goldfarb-Shanno Optimized Polynomial Neural Network; CDC: Centers for Disease Control and Prevention; cGAN: conditional generative adversarial net; CIA: Central Intelligence Agency; CIES: Centre International d'Etude du Sport (International Centre for Sports Studies); CMA: China Meteorological Administration; CMS: Centers of Medicare and Medicaid Services; CNIC: Chinese National Influenza Center; CPC-NN: Multivariate clustering based partial curve nearest neighbor; CRC: Coronavirus Resource Center; CSSE: Center for Systems Science and Engineering; CVD: cardiovascular disease; DCP: Department of Civil Protection; DE: differential evolution algorithm; DNN: deep neural network; DOH: Departments of Health; QDA:quadratic discriminant analysis; DT: decision tree; ECDC: European Centre for Disease Prevention and Control; EDEM: Ensemble-based Dynamic Emission Model; EN: Elastic net; ENAC: Ente Nazionale per l'Aviazione Civile (Italian Civil Aviation Authority); EPA: Environmental Protection Agency; ESPN: Entertainment and Sports Programming Network; FAO: Food and Agriculture Organisation of the United Nations; GB: gradient boosting; GBD: global burden of disease; GDP: gross domestic product; GHDx: Global Health Data Exchange; GHO: Global Health Observatory; GHS: Global Health Security; GIS: geographical information systems; GLEAM: global epidemic and mobility model; GSAID: global initiative on sharing all influenza data; IBGE: Brazilian Institute of Geography and Statistics; IDIS: Infectious Disease Information System of China; IEDCR: Institute of Epidemiology, Disease Control and Research; JHU: Johns Hopkins University; KFF: Kaiser Family Foundation; KG: Köppen–Geiger climate classification; k-NN: k-nearest neighbors; LDA: linear discriminant analysis; LR: logistic regression; LSTM: long short-term memory; LTCF: long-term care focus; MLP: multilayer perceptron; MoH: Ministry of Health; MOHME: Ministry of Health and Medical Education; NB: Naïve Bayes; NBS: National Bureau of Statistics of China; NC: nearest centroid; NCAR: National Center for Atmospheric Research; NCEP: National Centers for Environmental Prediction; NHC: National Health Commissions; NICSHC: National Investment Center for Seniors Housing and Care; NLP: natural language processing; NOAA: National Oceanic and Atmospheric Administration; NYT: New York Times; OECD: Organisation for Economic Co-operation and Development; OICA: Organisation Internationale des Constructeurs d'Automobiles (International Organization of Motor Vehicle Manufacturers); PC-NN: partial curve nearest neighbor; PCR: principal components regression; PLSR: partial least squares regression; PSO: particle swarm optimization algorithm; RF: random forest; RT: regression tree; SEIR: susceptible-exposed-infected-recovered model; SGD: stochastic gradient descent; SIP: shelter-in-place; SIR: susceptible-infected-recovered model; SOM: self-organizing maps; SUS: Sistema Único de Saúde (Brazil's publicly funded healthcare system); SVM: support vector machine; UN: United Nations; USALEEP: Small-Area Life Expectancy Estimates Project; USDA ERA: United States Department of Agriculture, Economic Research Service; UW: Washington University; WEO: World Economic Outlook database; WHO: World Health Organization; WHR: World Health Rankings; WPR: world population review.

Studies on COVID-19 forecasting that integrated heterogeneous data Data that are heterogeneous in syntax, schema, and semantics. Available at https://doi.org/10.7910/DVN/JHFOSE. ADHB: Human Development Atlas of Brazil; AHRQ: Agency for Healthcare Research and Quality; ANFIS: adaptive neuro fuzzy inference system; ANN: artificial neural network; ARIMA: autoregressive integrated moving average; ARPA: Regional Environmental Protection Agency; BBC: British Broadcasting Corporation; BCB: Central Bank of Brazil; BCG: Bacillus Calmette–Guérin; BGFS-PNN: Broyden-Fletcher-Goldfarb-Shanno Optimized Polynomial Neural Network; CDC: Centers for Disease Control and Prevention; cGAN: conditional generative adversarial net; CIA: Central Intelligence Agency; CIES: Centre International d'Etude du Sport (International Centre for Sports Studies); CMA: China Meteorological Administration; CMS: Centers of Medicare and Medicaid Services; CNIC: Chinese National Influenza Center; CPC-NN: Multivariate clustering based partial curve nearest neighbor; CRC: Coronavirus Resource Center; CSSE: Center for Systems Science and Engineering; CVD: cardiovascular disease; DCP: Department of Civil Protection; DE: differential evolution algorithm; DNN: deep neural network; DOH: Departments of Health; QDA:quadratic discriminant analysis; DT: decision tree; ECDC: European Centre for Disease Prevention and Control; EDEM: Ensemble-based Dynamic Emission Model; EN: Elastic net; ENAC: Ente Nazionale per l'Aviazione Civile (Italian Civil Aviation Authority); EPA: Environmental Protection Agency; ESPN: Entertainment and Sports Programming Network; FAO: Food and Agriculture Organisation of the United Nations; GB: gradient boosting; GBD: global burden of disease; GDP: gross domestic product; GHDx: Global Health Data Exchange; GHO: Global Health Observatory; GHS: Global Health Security; GIS: geographical information systems; GLEAM: global epidemic and mobility model; GSAID: global initiative on sharing all influenza data; IBGE: Brazilian Institute of Geography and Statistics; IDIS: Infectious Disease Information System of China; IEDCR: Institute of Epidemiology, Disease Control and Research; JHU: Johns Hopkins University; KFF: Kaiser Family Foundation; KG: Köppen–Geiger climate classification; k-NN: k-nearest neighbors; LDA: linear discriminant analysis; LR: logistic regression; LSTM: long short-term memory; LTCF: long-term care focus; MLP: multilayer perceptron; MoH: Ministry of Health; MOHME: Ministry of Health and Medical Education; NB: Naïve Bayes; NBS: National Bureau of Statistics of China; NC: nearest centroid; NCAR: National Center for Atmospheric Research; NCEP: National Centers for Environmental Prediction; NHC: National Health Commissions; NICSHC: National Investment Center for Seniors Housing and Care; NLP: natural language processing; NOAA: National Oceanic and Atmospheric Administration; NYT: New York Times; OECD: Organisation for Economic Co-operation and Development; OICA: Organisation Internationale des Constructeurs d'Automobiles (International Organization of Motor Vehicle Manufacturers); PC-NN: partial curve nearest neighbor; PCR: principal components regression; PLSR: partial least squares regression; PSO: particle swarm optimization algorithm; RF: random forest; RT: regression tree; SEIR: susceptible-exposed-infected-recovered model; SGD: stochastic gradient descent; SIP: shelter-in-place; SIR: susceptible-infected-recovered model; SOM: self-organizing maps; SUS: Sistema Único de Saúde (Brazil's publicly funded healthcare system); SVM: support vector machine; UN: United Nations; USALEEP: Small-Area Life Expectancy Estimates Project; USDA ERA: United States Department of Agriculture, Economic Research Service; UW: Washington University; WEO: World Economic Outlook database; WHO: World Health Organization; WHR: World Health Rankings; WPR: world population review. All 8 clustering studies used unsupervised machine learning models, with the most popular model being the k-means. These studies aimed to group and compare countries or regions based on COVID-19 incidence, risks, and preparedness or performance. Half of the studies (n = 4, 50.0%) integrated heterogeneous data for modeling (Table 1). One of the 4 studies imputed missing data with mean values (Table 1). The 4 studies on future demands predicted the need for intensive care unit (ICU) beds or medical consumables (eg, face masks) using data on COVID-19 cases or on consumable sales or production. All 4 studies used ANN (eg, MLP) or RNN (eg, LSTM), with some studies also building machine learning models. None of the studies integrated heterogeneous data for modeling.

Medical imaging-based diagnosis and prognosis

A total of 322 studies described the use of AI for analyzing medical imaging data for COVID-19 diagnosis and prognosis (Supplementary Table S1). All studies analyzed either computed tomography or chest X-ray data, except for 5 studies that analyzed images of lung ultrasound or skin lesions. The most common sources of medical images were local hospitals or healthcare systems and image datasets published on public domains, such as GitHub or Kaggle. In these imaging studies, roughly half used the convolutional neural network (CNN)-based models. More than 90% of these studies predicted COVID-19 outcomes using medical imaging data alone. Only 29 out of the 322 studies (9.0%) considered data from heterogenous sources for AI modeling (Table 2). In addition to imaging data, these studies considered influences from demographics (eg, age, sex, etc), clinical characteristics (eg, symptoms, lab results, disease history, etc), and other human factors (eg, exposure history) on COVID-19 outcomes. Five of these studies imputed missing data using simple mean or median imputation (Table 2).
Table 2.

Studies on medical imaging-based COVID-19 detection or prognosis using heterogeneous data

StudyRegionOutcomeData sourceModelHeterogeneous dataaMissing data imputation
Cai et al53ChinaRT-PCR negativitySingle hospitalUnspecified DL, LRCT image data, clinical dataReplaced by median
Cai et al54ChinaNeed and duration of ICU, duration of oxygen inhalation, duration of hospitalization, duration of sputum NAT-positive, clinical prognosisSingle hospital3DQI platform, U-Net, RFCT image data, clinical dataNo
Chao et al55USA, Iran, ItalyICU admission3 hospitalsDNN, RFCT image data, demographics, vitals, lab dataImputed by mean values
Chassagnon et al56FranceCOVID-19 staging and prognosis (mechanical ventilation)8 hospitalsCNN, DT, Linear SVM, XGBoosting, AdaBoost, LassoCT image data, clinical and biological markersNo
Cheng et al 57ChinaSevere vs. nonsevere COVID-19Single hospitalCNN (uAI Discover-2019nCoV)CT image data, clinical dataNo
D'Ambrosia et al58USART-PCR confirmed SARS-CoV-2 infectionSingle hospitalBN, SC, DML, LRSymptoms, local SARS-CoV-2 prevalence, CXR imaging, molecular diagnostic performanceNo
Ebrahimian et al59USA, South KoreaDeath vs. recovery, need for mechanical ventilationTertiary care hospitalsCNN (U-Net), LRCXR image data, Demographics, Lab dataNo
Fu et al60ChinaStable vs progressive COVID-19Unspecified hospitalsSVMCT image data, clinical and lab dataNo
Grodecki et al61USA, ItalyClinical deterioration vs death3 hospitalsCNN (U-Net), LRCT image data, clinical dataNo
Guo et al62ChinaCOVID-19 vs seasonal flu2 hospitalsRFCT image data, symptoms, blood tests, RT-PCR resultsNo
Hahm et al63South KoreaWorsening oxygenation eventSingle hospitalDL software (MEDIP)CT severity score, Demographics, Comorbidity, Lab dataNo
Hermans et al64The NetherlandsCOVID-19 positivity by RT-PCR2 hospitalsLRCT image data, demographics, symptoms, vitals, labNo
Ho et al65South KoreaSevere vs nonsevere COVID-195 hospitalsANN, CNN, ACNNCT image data, demographic, clinical, and lab dataNo
Jeong et al66South KoreaSevere vs nonsevere COVID-19Single hospitalAI software (syngo.via Frontier)CT severity score, demographics, symptoms, comorbidity, labNo
Kimura-Sandoval et al67MexicoNeed mechanical ventilation, deathSingle hospitalAI software (Siemens healthcare)CT variables, demographics, clinical, labNo
Lang et al68USAAcute neuroimaging findingsSingle hospitalUnspecified ML, LRCT severity score, demographics, clinical dataNo
Lassau et al69FrenchSevere vs nonsevere COVID-192 hospitalsCNN (EfficientNet-B0, ResNet50, U-Net), LRCT variables, AI-severity score (5 clinical, biological variables)Imputed with the average
Li et al70ChinaSevere vs nonsevere COVID-19Single hospitalCNN (U-net), RF, GB, XGBoost, LR, SVMCT outcomes, clinical biochemical indexesImputed with mean values
Liu et al71ChinaCOVID-19 vs. non-COVID-19 pneumoniaSingle hospitalCT image software (pyradiomics), LR, LASSOCT outcomes, clinical dataNo
Mei et al 72USACOVID-19 positivity by RT-PCR18 hospitalsCNN, SVM, RF, MLPCT findings, clinical symptoms, exposure history, LabNo
Meng et al73ChinaDeath within 14 days4 hospitalsCNN, LRCT image features, clinical informationNo
Mushtaq et al74ItalyDeath, ICU admissionSingle hospitalCNN (AI system qXR), Cox PHCXR severity, demographics, clinical dataNo
Ning et al75ChinaMorbidity, mortality2 hospitalsCNN, DNN, Ridge LRCT features, 130 types of clinical featuresNo
Quiroz et al76ChinaSevere vs nonsevere COVID-192 hospitalsCNN (U-Net), LR, XGBoostCT features, demographics, clinical dataImputed with mean values
Salvatore et al77ItalyCOVID-19 severity (discharge, hospitalization, ICU, or death)Single hospitalAI tool (Thoracic VCAR), LRCT parameters, clinical and lab dataNo
Varble et al78China, JapanAsymptomatic vs pre-symptomatic patients with SARS-CoV-22 hospitalsCNN (AH-Net), LASSO LRCT characteristics, clinical and lab dataNo
Xia et al79ChinaCOVID-19 vs. influenza A/B2 hospitalsDNNCXR and CT features, 56 clinical featuresNo
Xu et al80ChinaHealthy or COVID-19 pneumonia or non-COVID pneumoniaSingle hospitalCNN, SVM, KNN, RFCT features, 23 clinical features, 10 lab testing featuresNo
Xue et al51China4-level COVID-19 severityMultiple hospitalsDSA-MIL, MA-CLRLUC features, age, medical history, symptomsNo

Data that are heterogeneous in syntax, schema, and semantics.

3DQI: 3D quantitative imaging; ACNN: artificial convolutional neural network; AI: artificial intelligence; ANN: artificial neural networks; BN: Bayesian inference network; CNN: convolutional neural network; DL: deep learning; DML: distance metric-learning; DNN: deep neural network; DSA-MIL: dual-level supervised attention-based multiple; DT: decision tree; GB: gradient boosting; ICU: intensive care unit; LR: logistic regression; LUC: lung ultrasound; MA-CLR: modality alignment contrastive learning of representation instance learning; ML: machine learning; MLP: multilayer perceptron; NAT: nucleic acid testing; RF: random forest; SC: Information-theoretic Set Cover; SVM: support vector machine.

Studies on medical imaging-based COVID-19 detection or prognosis using heterogeneous data Data that are heterogeneous in syntax, schema, and semantics. 3DQI: 3D quantitative imaging; ACNN: artificial convolutional neural network; AI: artificial intelligence; ANN: artificial neural networks; BN: Bayesian inference network; CNN: convolutional neural network; DL: deep learning; DML: distance metric-learning; DNN: deep neural network; DSA-MIL: dual-level supervised attention-based multiple; DT: decision tree; GB: gradient boosting; ICU: intensive care unit; LR: logistic regression; LUC: lung ultrasound; MA-CLR: modality alignment contrastive learning of representation instance learning; ML: machine learning; MLP: multilayer perceptron; NAT: nucleic acid testing; RF: random forest; SC: Information-theoretic Set Cover; SVM: support vector machine.

Early detection and prognosis (nonimaging)

A total of 152 studies described the use of AI for COVID-19 early detection (n = 52) and prognosis (n = 100) (Supplementary Table S1). The vast majority of the studies on COVID-19 early detection analyzed COVID-19 positivity (+ vs −, determined by the reverse transcription polymerase chain reaction test) as the study outcome using patient data from hospitals or healthcare systems. A wide range of AI models were used for prediction, although machine learning models (eg, random forest, GBM) were used more often than deep learning models. Furthermore, most studies used a single type of data for COVID-19 detection, such as lab test data (eg, blood cell counts or inflammatory biomarkers) or clinical symptoms. Only 8 out of the 47 studies (17.0%) integrated heterogenous data for modeling (Table 3). In addition to lab and symptom data, these studies considered data on comorbidity, medications, travel/contact history, etc.
Table 3.

Studies on COVID-19 detection or prognosis using heterogeneous data

StudyRegionOutcomeData sourceModelHeterogeneous dataaMissing data imputation
Early detection
Ahamad et al81ChinaConfirmed vs. suspected COVID-19 casesMultiple hospitalsDT, RF, XGBoost, GB, SVMStructured EHR data (Demographics, symptoms), Structured EHR data (Isolation treatment status, Travel history)Imputed gender with random values based on male/female ratio; impute age with random values within IQR
Langer et al82ItalyCOVID-19 positivity by RT-PCRSingle hospitalANNDemographics, Comorbidity, Medications, Signs and Symptoms, Lab, Vitals, CXRNo
Martin et al83WorldwideCOVID-19 positivityLiterature (British Medical Journal)AI system (Symptoma)Keywords and symptoms, Age and sex, Symptom occurrence frequency rates, Country-specific disease incidencesNo
Obinata et al84JapanCOVID-19 positivity by RT-PCR2 hospitalsRFDemographics, Vitals, Lab, Symptoms, Contact historyNo
Otoom et al85WorldwideCOVID-19 positivityCORD-19 repositorySVM, ANN, NB, k-NN, decision table, decision stump, OneR, ZeroRSymptoms, travel history to suspicious areas, contact historyNo
Shimon et al86IsraelCOVID-19 positivityMultiple hospitalsCNN, SVM, RFVoice samples (acoustic features), self-reported symptomsNo
Wintjens et al87The NetherlandsCOVID-19 positivity by RT-PCRSingle hospitalANN, RF, LRBreath features (CO, NO2, VOC), clinical and demographic variablesNo
Zoabi et al88IsraelCOVID-19 positivity by RT-PCRThe Israeli Ministry of HealthGBDemographics, clinical symptoms, known contact with an infected individualNo
Prognosis
Al-Najjar et al89South KoreamortalityKCDCANNDemographics, infection reason and dateNo
An et al90South KoreamortalityKNHISLASSO, SVM, RF, k-NNSociodemographic and medical informationNo
Burian et al91GermanyICU admission1 hospitalRFDemographic, clinical, lab, and imaging dataImputed with mean or mode
Cheng et al92USAICU transfer in 24 hours1 hospitalRFDemographics, time-series of the admission–discharge–transfer events, clinical assessments, vital signs, lab and ECG resultsImputed with median value
Das et al93South KoreamortalityKCDCLR, SVM, k-NN, RF, GBDemographic and exposure featuresNo
Ge et al94ChinaVentilator parameters1 hospitalUnspecifiedDemographics, clinical data, Ventilator parametersNo
Haimovich et al95USAearly respiratory decompensation8 EDsRF, LASSO, GB, XGBoostDemographics, medical histories, vitals, outpatient medications, chest radiograph reports, LabNo
Hu et al96Chinamortality1 hospitalLR, PLS regression, EN, RF, bagged FDADemographics, CT features, labImputed using bagging trees
Iwendi et al97WorldwideSeverity, recovery, deathKaggle (WHO, JHU)RFDemographics, symptoms, travel dataNo
Josephus et al98WorldwidemortalityKaggle (WHO, JHU)LRDemographics, symptoms, travel dataImputed (unspecified)
Li et al99WorldwidemortalityGithub and Wolfram datasetLR, RF, SVMDemographics, location, symptoms, travel history, market exposure, chronic diseaseNo
Liang et al100ChinaICU admission, requiring mechanical ventilation, death, etcChinese NHCCPH, ANNDemographic, clinical, lab, and imaging dataImputed with multivariate imputation by chained equation
Ma et al101Chinamortality1 hospitalRF, XGboostSymptoms, comorbidity, demographic, vitals, CT scans results, labNo
Metsker et al102RussiamortalityRussian government, Single hospitalANNDemographics, comorbidity, lab, treatment, travel historyNo
Mountantonakis et al103USAAF and mortality13 hospitalsNLPDemographics, medical history, lab, NLP extracted atrial fibrillationNo
Nakamichi et al104USAHospitalization and mortalityMultiple hospitalsAdaBoost, ET, GB, RFDemographics, comorbidity, SARS-CoV-2 sequence cladesMultiple imputation by chained equations
Neuraz et al105Francein-hospital mortality39 hospitalsNLP, CoxDemographics, comorbidity, NLP extracted use of calcium channel blockersNo
Patel et al106USASeverity3 hospitalsRF, ANN (MLP), SVM, GB, ET classifier, AdaBoostDemographics, international travel, contact history, comorbidity, symptoms, blood panel profileNo
Planchuelo-Gómez et al107Spainheadache1 hospitalGLM, PCAIntensity and self-reported disability caused by headache, quality and topography of headache, migraine features, COVID-19 symptoms, lab.No
Schwartz et al108CanadamortalityiPHIS, CORES, The COD, CCMtool, CCMNLP, LRDemographics, comorbidities, symptoms, NLP extracted long-term care home exposureImputed by weekly median value
Wu et al109China, Italy, BelgiumICU admission, death, etcMultiple hospitalsRF, LRDemographic, clinical, lab, and imaging dataNo

Data that are heterogeneous in syntax, schema, and semantics.

AF: atrial fibrillation; ANN: artificial neural networks; CCM: Public Health Case and Contact Management Solution; CCMtool: Middlesex-London COVID-19 Case and Contact Management tool; CO: carbon monoxide; COD: the Ottawa Public Health COVID-19 Ottawa Database; CORD-19: COVID-19 Open Research Dataset; CORES: Toronto Public Health Coronavirus Rapid Entry System; CPH: Cox proportional hazard; CT: computed tomography; CXR: chest x-ray; DT: decision tree; ECG: electrocardiogram; ED: emergency department; EHR: electronic health record; EN: elastic net; ET: extra trees; FDA: flexible discriminant analysis; GB: gradient boosting; GLM: generalized linear model; ICU: intensive care unit; iPHIS: integrated Public Health Information System; IQR: interquartile range; JHU: John Hopkins University; KCDC: Korea Centers for Disease Control and Prevention; KNHIS: Korean National Health Insurance Service; k-NN: k-nearest neighbors; LR: linear regression; MLP: multilayer perceptron; NB: Naïve Bayes; NHC: National Health Commission; NLP: natural language processing; NO2: nitrogen dioxide; PCA: principal component analysis; PLS: partial least squares; RBF: radial basis function; RF: random forest; SHAP: Shapley additive explanation; SVM: support vector machine; VOC: volatile organic compound; WHO: World Health Organization.

Studies on COVID-19 detection or prognosis using heterogeneous data Data that are heterogeneous in syntax, schema, and semantics. AF: atrial fibrillation; ANN: artificial neural networks; CCM: Public Health Case and Contact Management Solution; CCMtool: Middlesex-London COVID-19 Case and Contact Management tool; CO: carbon monoxide; COD: the Ottawa Public Health COVID-19 Ottawa Database; CORD-19: COVID-19 Open Research Dataset; CORES: Toronto Public Health Coronavirus Rapid Entry System; CPH: Cox proportional hazard; CT: computed tomography; CXR: chest x-ray; DT: decision tree; ECG: electrocardiogram; ED: emergency department; EHR: electronic health record; EN: elastic net; ET: extra trees; FDA: flexible discriminant analysis; GB: gradient boosting; GLM: generalized linear model; ICU: intensive care unit; iPHIS: integrated Public Health Information System; IQR: interquartile range; JHU: John Hopkins University; KCDC: Korea Centers for Disease Control and Prevention; KNHIS: Korean National Health Insurance Service; k-NN: k-nearest neighbors; LR: linear regression; MLP: multilayer perceptron; NB: Naïve Bayes; NHC: National Health Commission; NLP: natural language processing; NO2: nitrogen dioxide; PCA: principal component analysis; PLS: partial least squares; RBF: radial basis function; RF: random forest; SHAP: Shapley additive explanation; SVM: support vector machine; VOC: volatile organic compound; WHO: World Health Organization. The vast majority of the studies on COVID-19 prognosis examined hospitalization, ICU admission, mechanical ventilation requirements, and/or death in COVID-19 patients using data from hospitals or healthcare systems. Traditional machine learning models were preferred over deep learning models, with the most popular model being random forest. Only 21 out of the 92 studies (22.8%) integrated heterogenous data for modeling (Table 3). These heterogenous data included demographics, clinical data (eg, lab, disease and medication history, and symptoms), genetic sequencing data, exposure history, etc. In the early detection and prognosis studies that integrated heterogenous data (Table 3), 8 studies imputed missing data. Most studies performed simple imputation based on mean, mode, or median values, while 2 studies performed multivariate imputation by chained equations,, and 1 study imputed missing values using bagging trees.

Drug repurposing and early drug discovery

A total of 53 studies described the use of AI for drug repurposing (36 studies) or early COVID-19 drug discovery (18 studies) (Supplementary Table S1). The majority of the studies focused on screening for candidate drugs in biomolecule or drug databases. Popular data sources included DrugBank (Food and Drug Administration [FDA]-approved and experimental drugs), ChEMBL (bioactivity database for drug discovery), PubChem (substance and compound databases), ZINC (commercially available compounds for virtual screening), BindingDB (experimentally determined protein-ligand binding affinities). Deep learning models (eg, CNN, RNN) were used more often than the machine learning models. Furthermore, 5 out of the 36 drug repurposing studies mined the literature for repurposable drugs. All 5 studies used NLP-based methods to mine scientific literature or other relevant data. For example, 1 study examined the description of over 1.2 million bioassays in the ChEMBL database to identify COVID-19-related bioassays. The 18 studies on early drug discovery mainly focused on screening for potential biomolecules (eg, virtual ligand screening) in ligand or compound databases (eg, ChEMBL, PubChem, ZINC, BindingDB) that could target SARS-CoV-2 functional domains. Similarly, deep learning models were preferred over the machine learning models. None of drug repurposing or early drug discovery studies integrated heterogeneous data for modeling.

Social media data analysis

A total of 44 studies described the use of AI for analyzing social media data (Supplementary Table S1). In these studies, Twitter was the single most popular data source, with 32 studies analyzing tweets from all over the world. The other 12 studies used data from Facebook, Reddit, YouTube, Weibo, etc. Most social media studies adopted a similar analytic approach: NLP methods and tools for text extraction and processing, followed by topic modeling and/or a sentiment analysis. The most common method for topic modeling was the latent Dirichlet allocation, whereas a range of machine learning models were used for sentiment analysis including SVM, Naïve Bayes, k-NN, random forest, etc. None of the social media studies integrated heterogeneous data for modeling.

Genomic, transcriptomic, and proteomic data analysis

A total of 24 studies described the use of AI for analyzing SARS-CoV-2 sequence data (eg, ribonucleic acid [RNA], small interfering RNA [siRNA ], or protein sequences) (Supplementary Table S1). One common analysis goal of many of these studies was to determine the unique SARS-CoV-2 RNA or protein features that could potentially be targeted for disease detection and drug or vaccine design. Over half of these studies analyzed the SARS-CoV-2 genome sequences in the National Center for Biotechnology Information GenBank. Other data sources included the Protein Data Bank, National Genomics Data Center of China, or self-generated sequence data. A wide variety of AI models were used in these studies, including the deep learning models (CNN, RNN) and the traditional machine learning models (k-NN, SVM, random forest, GBM). None of the studies integrated heterogeneous data for modeling.

Other COVID-19 research studies

Survey studies

A total of 14 survey studies used AI models for studying COVID-19-related topics in various populations around world (Supplementary Table S1). The most common study outcomes were self-reported fear, stress, anxiety, and depression related to the pandemic. The majority of the studies used machine learning models, including random forest, XGBoost, SVM, and Naïve Bayes. Two of the studies,, which were based on the same online survey, collected text data using open-ended questions. These studies performed a sentiment analysis that involved sentiment scores calculation and clustering using the k-mean algorithm. None of the survey studies integrated heterogeneous data for modeling.

Literature mining

A total of 10 studies described the use of AI for mining COVID-19 literature (Supplementary Table S1). Literature mining studies on drug repurposing were summarized in a previous section. These 10 studies focused on summarizing topics and trends in COVID-19 research and identifying future research needs. All but 2 studies mined either PubMed or the COVID-19 Open Research Dataset. Of the other 2 studies, 1 mined ClinicalTrials.gov to extract data on COVID-19-related trials, while the other searched the Scopus database for a bibliometric analysis. All of the studies involved NLP methods and tools (eg, word2vec, doc2vec). Some studies performed topic modeling and/or sentiment analysis. The only study that performed heterogeneous data integration was Reese et al (Table 4), in which data from 13 heterogeneous knowledge sources (eg, scientific literature, COVID-19 cases, drug, genome sequences, chemicals, etc) were downloaded, transformed, and integrated to create the KG-COVID-19 knowledge graph.
Table 4.

Other COVID-19 studies using heterogeneous data

StudyRegionOutcomeData sourceModel Heterogeneous data a Missing data imputation
Literature mininng
Reese et al128N/AKnowledge Graphs for COVID-19 Response13 knowledge sourcesTraditional or graph-based MLScientific literature, COVID-19 cases and mortality, Drug, Genome sequence, Diseases, ChemicalsN/A
Surveillance
Franchini et al129ItalyIndividualized COVID-19 riskSurvey, medical recordsRF, SVM, GBMDemographic, Heath status, Other health and social informationNo
Miscellaneous topics
Abdalla et al130USASocial distancingNYT, Census Bureau, USDA ERS, CDC, Google Community Mobility ReportsElastic net43 socio-demographic variablesNo

Data that are heterogeneous in syntax, schema, and semantics.

CDC: Centers for Disease Control and Prevention; GBM: gradient boosting machine; ML: machine learning; NYT: New York Times; RF: random forest; SVM: support vector machine; USDA ERA: US Department of Agriculture Economic Research Service.

Other COVID-19 studies using heterogeneous data Data that are heterogeneous in syntax, schema, and semantics. CDC: Centers for Disease Control and Prevention; GBM: gradient boosting machine; ML: machine learning; NYT: New York Times; RF: random forest; SVM: support vector machine; USDA ERA: US Department of Agriculture Economic Research Service.

Surveillance

A total of 6 studies described the use of AI for social distancing or syndromic surveillance (Supplementary Table S1). Three of these studies analyzed data from surveillance cameras for monitoring social distancing using well-known deep learning models for object detection, including the single-shot detector, YOLO (you only look once), and/or the regional CNN detector. Two other studies focused on analyzing Bluetooth signal strength data with linear and logistic models for contact tracing or developing NLP and deep learning-based pipeline for sentinel syndromic surveillance of COVID-19 using medical records. The remaining study developed a Telegram Bot that could model individualized COVID-19 risk by integrating heterogenous data, including user responses and health/social data in medical records (Table 4). This lone study involving heterogenous data used machine learning models random forest, SVM, and GBM.

Clinical trials

Two studies described the use of AI models in noninterventional clinical trials on COVID-19 patients (Supplementary Table S1). The 2 trials, namely the READY (NCT04390516) and IDENTIFY (NCT04423991),, were conducted by the same group of investigators based on the same machine learning algorithm (an XGBoost classifier) designed to predict mechanical ventilation and mortality within 24 hours upon hospital admission using inputs from clinical data. The READY trial evaluated the performance of the algorithm, while the IDENTIFY trial identified a subpopulation of COVID-19 patients who had improved survival from taking hydroxychloroquine. Neither study integrated heterogenous data for modeling.

Miscellaneous topics

A total of 6 studies did not fall under any of the previous research topics (Supplementary Table S1). In the lone study that integrated heterogeneous data for modeling, Abdalla et al integrated 43 sociodemographic variables from multiple sources (eg, Census Bureau, US Department of Agriculture, Centers for Disease Control and Prevention) and built elastic net models to examine how sociodemographics impacted county-level social distancing (Table 4). Of the remaining studies, 1 used ANN to perform a drive-through mass vaccination simulation, while the other 4 used NLP methods and tools on various research topics, including cross-lingual clinical deidentification in electronic health records (EHRs), dream reports analysis, drug safety analysis by mining the FDA adverse event system, COVID-19 clinical concept (signs and symptoms) identification, and normalization in EHRs.

DISCUSSION

As governments, research communities, and healthcare industries are actively attempting to address the COVID-19 pandemic, we are tasked to identify quick yet reliable solutions for screening, diagnosis, forecasting, surveillance, the development of vaccine or drugs, and so on. On the other hand, with large amounts of COVID-19-related data being collected in novel surveillance systems, AI methods have been widely employed in assisting medical experts and researchers in addressing COVID-19 challenges. In this article, we reviewed 1338 recent studies that applied AI methods or technologies in COVID-19 research. In the 794 studies included in our final qualitative analysis, we identified 7 key areas in which AI was applied. We also found that a wide range of machine learning and deep learning algorithms were used for modeling, although some were used more frequently than others depending on the area of research. It is not at all surprising that AI methods have been used extensively in many areas of COVID-19 research. AI has been revolutionary for many analytics challenges in medicine and public health. For example, just shy of half of the studies we reviewed were studies of medical imaging analysis for assisting COVID-19 diagnosis. In fact, the use of AI in diagnostic medical imaging has been extensively explored for many diseases, such as cancer, cardiovascular diseases,, lung diseases, and brain diseases. In these applications, AI has shown impressive sensitivity—similar to or better than expert interpretation—in identifying patterns and abnormalities in medical images that can aid diagnosis. Another major AI application in COVID-19 research is disease forecasting, with one-fifth of the studies we reviewed being in this category. Compared to popular statistical time series models such as the ARIMA, AI models such as the LSTM have been proven to have superior precision and accuracy when predicting time series data, without making explicit assumptions (eg, stationarity) about the data. In several other areas of COVID-19 research, AI methods are the preferred data analysis tools because of their ability to handle large amounts of heterogenous data, including text data such as those in clinical narratives or on social media. For example, in drug discovery and genomic research, AI is ideal for analyzing massive amounts of sequence data (eg, proteomic or genomic data)., One limitation of the AI applications included in our scoping review is the lack of integration of data from heterogenous sources for modeling. In the era of precision health, it is critical to examine a comprehensive list of determinants of COVID-19 outcomes, including biological, clinical, social, behavioral, and environmental factors, that exist in various heterogeneous data sources. However, most studies we reviewed used data from a single source to perform the AI-driven tasks. For instance, over 90% of the imaging studies included in this review used data from radiological images only to build AI models for COVID-19 diagnosis. This single-sourced approach ignores other important risk factors such as clinical symptoms, exposure history, lab test results, and so on, leading to algorithms with bias (eg, confounding bias) and suboptimal performance. In fact, many of the medical imaging studies that integrated heterogenous data have shown that data integration led to AI models with better performance compared to models built with imaging data alone.,,,, Furthermore, although some data are difficult to get due to privacy issues or simply being unavailable, there are still a range of public data on risk factors that could be easily obtained for modeling. Many studies we reviewed leveraged the “free” data sources, such as the huge amounts of environmental data from the National Oceanic and Atmospheric Administration or the socioeconomic data from the Census Bureau. Overall, integrating heterogenous but relevant data for modeling will help realize the full potential of AI algorithms, and thus improve precision and reduce bias. Our review highlights the need for a multilevel AI framework that supports the analysis of heterogenous data from difference sources. Our scoping review has several limitations. First, our search strategy is not as comprehensive as that of a systematic review. For example, our keyword list did not include “AI.” Articles that used the abbreviation “AI” without mentioning “artificial intelligence” were not included in this review. Although we do not expect a large amount of articles being omitted, we do acknowledge this limitation in keywords. Second, we searched 2 major COVID-19 literature databases rather than the traditional databases used in systematic literature reviews. Relevant articles were often indexed in these 2 COVID-19 databases with a delay of a few days up to months. Third, we did not perform a risk of bias assessment given this is a scoping review.

CONCLUSION

Huge amounts of novel data related to COVID-19 have emerged quickly during the pandemic. As a result, AI methods and technologies have been widely applied in efforts to overcome COVID-19 challenges. In this scoping review (date of literature search: March 9, 2021), we show that a broad range of AI algorithms are used for COVID-19 research, and these algorithms are primarily used in 7 major research areas. We also show that there is a lack of data integration in these AI applications and a need for a multilevel AI framework that supports the analysis of heterogenous data from difference sources.

FUNDING

Drs Guo and Bian were funded in part by the National Institutes of Health (NIH) (Award number: R01 CA246418, R21 CA245858, R21 AG068717, R21 CA253394) and Centers for Disease Control and Prevention (Award number: U18 DP006512).

AUTHOR CONTRIBUTIONS

JB and YG conceived the project. YZ and TL performed the literature search and article screening, with YG being the third reviewer. YZ and TL performed the information extraction and created the initial tables. YG drafted the manuscript. MP, FW, HX, and JB assisted in writing. All authors read and approved the manuscript.

SUPPLEMENTARY MATERIAL

Supplementary material is available at Journal of the American Medical Informatics Association online. Click here for additional data file.
  4 in total

1.  Defining AMIA's artificial intelligence principles.

Authors:  Anthony E Solomonides; Eileen Koski; Shireen M Atabaki; Scott Weinberg; John D McGreevey; Joseph L Kannry; Carolyn Petersen; Christoph U Lehmann
Journal:  J Am Med Inform Assoc       Date:  2022-03-15       Impact factor: 4.497

2.  A systems biology approach identifies candidate drugs to reduce mortality in severely ill patients with COVID-19.

Authors:  Vinicius M Fava; Mathieu Bourgey; Pubudu M Nawarathna; Marianna Orlova; Pauline Cassart; Donald C Vinh; Matthew Pellan Cheng; Guillaume Bourque; Erwin Schurr; David Langlais
Journal:  Sci Adv       Date:  2022-06-01       Impact factor: 14.957

Review 3.  Strategies to identify candidate repurposable drugs: COVID-19 treatment as a case example.

Authors:  Ali S Imami; Robert E McCullumsmith; Sinead M O'Donovan
Journal:  Transl Psychiatry       Date:  2021-11-16       Impact factor: 6.222

Review 4.  Bias in algorithms of AI systems developed for COVID-19: A scoping review.

Authors:  Janet Delgado; Alicia de Manuel; Iris Parra; Cristian Moyano; Jon Rueda; Ariel Guersenzvaig; Txetxu Ausin; Maite Cruz; David Casacuberta; Angel Puyol
Journal:  J Bioeth Inq       Date:  2022-07-20       Impact factor: 2.216

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.