| Literature DB >> 35311562 |
Bahrad A Sokhansanj1, Gail L Rosen1.
Abstract
Next-generation sequencing has been essential to the global response to the COVID-19 pandemic. As of January 2022, nearly 7 million severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) sequences are available to researchers in public databases. Sequence databases are an abundant resource from which to extract biologically relevant and clinically actionable information. As the pandemic has gone on, SARS-CoV-2 has rapidly evolved, involving complex genomic changes that challenge current approaches to classifying SARS-CoV-2 variants. Deep sequence learning could be a potentially powerful way to build complex sequence-to-phenotype models. Unfortunately, while they can be predictive, deep learning typically produces "black box" models that cannot directly provide biological and clinical insight. Researchers should therefore consider implementing emerging methods for visualizing and interpreting deep sequence models. Finally, researchers should address important data limitations, including (i) global sequencing disparities, (ii) insufficient sequence metadata, and (iii) screening artifacts due to poor sequence quality control.Entities:
Keywords: COVID-19; SARS-CoV-2; bioinformatics; deep learning; explainable AI; genomics; machine learning; viral genomics
Year: 2022 PMID: 35311562 PMCID: PMC9040592 DOI: 10.1128/msystems.00035-22
Source DB: PubMed Journal: mSystems ISSN: 2379-5077 Impact factor: 7.324
FIG 1Total number of sequences submitted to GISAID (4) (left axis and bars, blue) and ratio of the number of submitted sequences to the total number of reported cases (104) as of 7 January 2022 for the 30 countries that have submitted the most sequences to GISAID. The overwhelming majority sequences in GISAID come from North America and Europe. Over half of all sequences are from the United States and United Kingdom alone. Sequencing rates show even greater disparities.