Literature DB >> 35311562

Mapping Data to Deep Understanding: Making the Most of the Deluge of SARS-CoV-2 Genome Sequences.

Abstract

Next-generation sequencing has been essential to the global response to the COVID-19 pandemic. As of January 2022, nearly 7 million severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) sequences are available to researchers in public databases. Sequence databases are an abundant resource from which to extract biologically relevant and clinically actionable information. As the pandemic has gone on, SARS-CoV-2 has rapidly evolved, involving complex genomic changes that challenge current approaches to classifying SARS-CoV-2 variants. Deep sequence learning could be a potentially powerful way to build complex sequence-to-phenotype models. Unfortunately, while they can be predictive, deep learning typically produces "black box" models that cannot directly provide biological and clinical insight. Researchers should therefore consider implementing emerging methods for visualizing and interpreting deep sequence models. Finally, researchers should address important data limitations, including (i) global sequencing disparities, (ii) insufficient sequence metadata, and (iii) screening artifacts due to poor sequence quality control.

Entities: Chemical

Keywords: COVID-19; SARS-CoV-2; bioinformatics; deep learning; explainable AI; genomics; machine learning; viral genomics

Year: 2022 PMID： 35311562 PMCID： PMC9040592 DOI： 10.1128/msystems.00035-22

Source DB: PubMed Journal: mSystems ISSN： 2379-5077 Impact factor: 7.324

PERSPECTIVE

COVID-19 has been called the “first pandemic in the post-genomic era” (1). The first severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome was published on 12 January 2020, a week after the WHO first reported on the virus. Only 5 days later, the sequence was used to design the mRNA vaccines that have changed the course of the pandemic (2). Since then, next-generation sequencing technology has enabled an unprecedented view of genetic changes in the virus throughout both the duration of the pandemic and different parts of the world (1, 3). Global data sharing of sequence data has been equally critical, much to the credit of the GISAID EpiCoV database project (4). GISAID’s primary mission has been to share flu genomes, in part to help design the annual flu vaccine (its full name being the Global Initiative on Sharing All Influenza Data) (5). Now, at the beginning of 2022, the GISAID EpiCoV database has accumulated nearly 7 million SARS-CoV-2 genome sequences, and at present, around 800,000 sequences are being added each month. So much data has been generated and made available that it has spurred the development of computational tools for high-frequency sequence variant tracking (6) and even daily updates (7, 8). Despite the surge in research efforts devoted to COVID-19 (9, 10), laboratory study of the virus remains a more specialized and time-consuming effort than sequencing. Clinical and epidemiological data are often superficial, measuring only a few variables, other than data sets specific to particular facilities or narrow populations. We need to fully capitalize on the abundant data that we do have to (i) anticipate how changes in the virus might affect health before we have time to gather empirical data and (ii) better design and interpret experiments to maximize our use of limited resources. So, how can we translate genome sequence data to as much biological understanding and actionable clinical insight as possible?

SARS-COV-2 IS RAPIDLY EVOLVING

SARS-CoV-2 has spent the first 2 years of the pandemic rapidly evolving in ways that have had a big impact on virulence, transmission, and ability to evade our immune responses (11). SARS-CoV-2 is an RNA virus, so its genome is prone to mutate—albeit at a rate mitigated by its large genome size and the proofreading function of its exoribonuclease (12). The most frequent mutations observed in coronaviruses are generally substitutions, although insertions and deletions are observed as well (13). In some cases, insertions from other viral genomes may occur, and, in fact, it appears as though the SARS-CoV-2 genome includes an insertion from human RNA (14). In other human coronaviruses, the estimated mutation rate is around 3 × 10−4 substitutions per site per year (15, 16). The amount of mutation observed during the COVID-19 pandemic has been even more substantial than expected (17). An early estimate of SARS-CoV-2 mutation was 6 × 10−4 substitutions per site per year (18). But the disease has spread widely around the world since then, and novel variants transmit more quickly—increasing the opportunities for the virus to mutate (19, 20). The SARS-CoV-2 spike protein will continue to change in the future. Studies on another human coronavirus, HCoV-OC43, suggest that genetic drift plays a role in coronavirus adaptive evolution (21). One study estimates that as of July 2021, SARS-CoV-2 had only “explored” 31% of the potential space for spike gene variation, based on comparisons with related sarbecoviruses (22).

SEQUENCE ANALYSIS HAS STRUGGLED TO KEEP UP

The first widely used tool for tracking SARS-CoV-2 genomic variation was the Nextstrain project, https://nextstrain.org. Nextstrain, originally developed as a general tool for viruses, was adapted to offer clade definitions for SARS-CoV-2 based on phylogenetic analysis (23). Phylogenetic tree reconstruction has been effective in inferring viral origins and trace transmission changes but not as useful in classifying genomes because the virus can accumulate and drop mutations in parallel across clades and subclades (24). The Pango nomenclature (https://cov-lineages.org/), developed specifically for SARS-CoV-2, has largely supplanted Nextstrain clade definitions (25). New sequences are assigned to Pango classifications, called “lineages,” using the Random Forests classification algorithm. A new Pango lineage is defined when a sufficient number of viral sequences emerges with a phylogenetic dissimilarity from existing sequences above a set threshold (26). Particularly significant Pango lineages have been identified by the World Health Organization (WHO) as variants of concern (VOC), which are given Greek letter designations (27), such as Alpha (Pango lineage B.1.1.7), Beta (B.1.351), Delta (B.1.167.2), and, recently, Omicron (B.1.1.529). While Pango lineages appear clear and well-defined, the reality is that the genome is much more fluid. If we want to understand how genome affects viral function, we cannot rely on traditional taxonomic categorization. As mutations recur, revert, and proliferate, taxonomy hits its limits of utility (11). As an initial matter, changes to SARS-CoV-2 properties often implicate combinations of multiple mutations that emerge simultaneously—and then sometimes revert in whole or in part as the virus continues to evolve (28, 29). For example, one frequent spike protein amino acid substitution, N501Y, has appeared and reverted contemporaneously in multiple clades and lineages, with no evidence of recombination (30). Simultaneous mutations can also have unpredictable, nonlinear effects, i.e., they can be synergistic, antagonistic, or fully independent (31). This complicates classical and Bayesian logistic regression methods for predicting fitness or protein function from mutations, as they rely on assuming the independence between mutations of individual amino acids or bases (32). SARS-CoV-2 evolution is also highly nonlinear. Widespread lineages, such as Delta, have spawned complex sublineages with distinct immune evasion and virulence properties, which often genetically share more in common with distantly related lineages than their most recent ancestor (33, 34). The increasingly complex evolutionary history of the virus stymies other proposed methods for genetically subtyping viral variants as well (35–37). Further complicating the picture, some immunocompromised individuals can have chronic infections lasting 6 months to a year (38). During long-term infection, a spike protein can emerge with multiple variations, which phylogenetic analysis identifies as “long branch” divergence from the phylogenetic tree (39). Some long-term patients may even be treated with convalescent plasma or antibodies, which may select for immune evasive mutations (40). The Omicron variant has such a long branch divergence, indicating that it may have emerged in an immunocompromised host or after incubating in a nonhuman host such as mice (41, 42).

CAN DEEP SEQUENCE LEARNING HELP?

How can we predict the virulence, fitness, antibody evasion, and other key properties of novel SARS-CoV-2 variants from complex, nonlinear changes in genetic sequence? Machine learning can tackle complex pattern recognition problems by training a model that can classify the organisms or genes by phylogeny or phenotype based on features of their genetic sequences. For example, we can extract k-mer (short subsequence) frequencies or other combinations of bases/amino acids and use them as features to train classifiers using naive Bayes classifier (NBC), support vector machines (SVM), decision tree-based methods, and neural networks (43–49). Machine learning with k-mer features has been used for SARS-CoV-2 to identify genetic fingerprints of specific infections (50), classify variants (51, 52), and train a model to predict the pathogenicity of unknown viruses (53). Another approach is to build profile hidden Markov models (HMMs), which can identify taxonomic lineages and variants of viruses. HMMs have been used to align SARS-CoV-2 sequences and compare its spike protein to that of other coronaviruses (22, 54, 55). Deep learning has emerged as an even more powerful and flexible tool to find patterns in large and complicated data sets (56–59). Deep learning models use multiple layers of neural networks to automatically extract and transform features during training (56–58). We can borrow deep learning methods developed for natural language processing (NLP) to find patterns in sequence data, where bases and amino acids that make up genome and protein sequences are analogous to semantic relationships between the words that make up sentences (60–63). For example, one group of researchers has used concepts from semantic processing, e.g., the frequency of correlated words, to identify potential mutagenic sites in viruses including SARS-CoV-2 (64). An emerging approach to deep sequencing learning is to transform protein sequences to embeddings that reflect their semantic structure, using the BERT (bidirectional encoder representations from transformers) neural network architecture, which Google developed to handle natural language search (65–68). An example of this approach is k-means clustering of “ProtBERT” SARS-CoV-2 protein embeddings generated by pretraining a BERT model on millions of UniProt sequences, which can be used to identify mutational hot spots within the genome that may give rise to future variants (69). A key goal for modeling is to predict the health risk of emerging variants before empirical data are available. To this end, our group has developed a deep learning model to predict patient outcomes for emerging sequence variants that takes into account patient demographics (70). Others are working to integrate sequence learning with computational protein structure models. For example, one project combines models of cell receptor binding and immune epitope alteration with transformer-based deep learning models to predict the fitness advantage of mutations (71). Deep learning has also been used to identify the relationship between protein sequence and function using data from deep mutational scanning, an experimental technique for massively parallel functional analysis of protein sequence site mutations (72, 73). Using this approach, another project predicts the risk for emerging variants by using a neural network to predict infectivity and vaccine breakthrough in combination with protein structure and binding prediction to model antibody resistance (74).

LOOKING INSIDE THE DEEP LEARNING BLACK BOX

Deep learning methods excel at identifying complex features within data that allow classification. But they have a major weakness. Deep learning relies on neural networks, and it is very hard to determine why a neural network makes a particular classification or prediction. Interpretable, or explainable, machine learning can fill this important gap (75, 76). Interpretable machine learning is particularly important in bioinformatics, since explaining a model’s predictions is critical to justify making high-stakes clinical or research decisions based on machine learning predictions (77, 78). Accordingly, developers of deep learning approaches to SARS-CoV-2 should consider providing some functionality to interpret or explain predictions. Analytical tools for interpretability in deep learning include examining neural network structure through relevance propagation, activation difference propagation, sensitivity analysis, and saliency map methods (79–81). Integrated gradients have been used to analyze RNA splicing models (82). An increasingly popular approach is the “attention” mechanism originally developed for NLP (83, 84). Attention can highlight important features in text processed by deep learning models (85–87). The amount of “attention” at a position in a sequence correlates with the weight put on that position in a trained model, where high attention at a position implies potential significance. Architectures combining convolutional neural networks (CNNs) with attention have been used to identify sequence motifs for functional genomics, e.g., transcription factor binding site detection (88, 89). Another group generated predictive models of adverse drug reactions based on chemical structures by combining attention with a CNN for each chemical property and structural feature in the model (90). Our group has shown that attention in combination with a recurrent neural network-based sequence model can provide insight into taxonomic and phenotypic classification of microbial 16s rRNA sequences (91), as well as gene ontology classifications of protein sequences (92). Recently, transformer-based architectures have emerged, like the aforementioned BERT (93). Transformers are built on multiple attention modules (“heads”), which could be used for interpretability (94). For example, one recent paper demonstrated how different attention heads attended to different aspects of a learning task to identify nucleotide motifs for promoter sequences (95). However, attention cannot be inherently drawn out of transformers. Further processing steps are generally required to connect attention to specific linguistic features (96). Our group recently applied a self-attention layer after a transformer as a way to more readily extract and visualize attention across the sequence and applied it to SARS-CoV-2 (70). An important caveat is that, based on comparing attention to empirical evidence, attention does not necessarily imply explanation—at least in the sense of explaining precisely why a prediction took place (97). Attention can only highlight features that the attention layer of the deep learning model weighted most heavily during training, so it may only weakly indicate the complete set of important features for a classification problem.

SEQUENCING DISPARITIES AND DATA CHALLENGES

Finally, we highlight three important data limitations that researchers should address. First, as Fig. 1 shows, there are serious global inequities in sequencing data, with the overwhelming majority of sequences coming from Europe and North America. GISAID has encouraged data sharing from developing countries by trading restrictions on republishing sequence information for access to that information (98). But global sequencing resources are disparately available (99). Even within Europe and the United States, racial and regional disparities in sequencing found in other surveys (100) hamper SARS-CoV-2 sequencing as well. Second, the task of interpreting sequencing data is complicated by insufficient sample metadata, making it difficult to understand how SARS-CoV-2 sequences affect patient outcomes, for example. In GISAID, most sequences only have information about a patient’s age or gender (if available) and the location where the sample was collected. As of 7 January 2022, a little over 270,000 sequences (4%) of the nearly 6.9 million have any metadata for patient outcomes, and many metadata entries are unintelligible. Sequencing projects should be encouraged to collect and curate as much information as possible about the sample and meet minimum information standards for sequence metadata (101). Third, sequencing errors can lead to spurious results. Quality control is critical to make sure that low-frequency sequence variants are real (102). Sequences can pick up contaminants from other variants in the amplification process, leading to what appear to be recombinant variants but which are in fact simply artifacts (103).

FIG 1

Total number of sequences submitted to GISAID (4) (left axis and bars, blue) and ratio of the number of submitted sequences to the total number of reported cases (104) as of 7 January 2022 for the 30 countries that have submitted the most sequences to GISAID. The overwhelming majority sequences in GISAID come from North America and Europe. Over half of all sequences are from the United States and United Kingdom alone. Sequencing rates show even greater disparities.

81 in total

Review 1. Correct machine learning on protein sequences: a peer-reviewing perspective.

Authors: Ian Walsh; Gianluca Pollastri; Silvio C E Tosatto
Journal: Brief Bioinform Date: 2015-09-26 Impact factor: 11.622

2. Learning the language of viral evolution and escape.

Authors: Brian Hie; Ellen D Zhong; Bonnie Berger; Bryan Bryson
Journal: Science Date: 2021-01-15 Impact factor: 47.728

3. Deep transformers and convolutional neural network in identifying DNA N6-methyladenine sites in cross-species genomes.

Authors: Nguyen Quoc Khanh Le; Quang-Thai Ho
Journal: Methods Date: 2021-12-13 Impact factor: 4.647

4. Racial/Ethnic Disparities in Genomic Sequencing.

Authors: Daniel E Spratt; Tiffany Chan; Levi Waldron; Corey Speers; Felix Y Feng; Olorunseun O Ogunwobi; Joseph R Osborne
Journal: JAMA Oncol Date: 2016-08-01 Impact factor: 31.777

5. Spreading of a new SARS-CoV-2 N501Y spike variant in a new lineage.

Authors: Philippe Colson; Anthony Levasseur; Jeremy Delerce; Lucile Pinault; Pierre Dudouet; Christian Devaux; Pierre-Edouard Fournier; Bernard La Scola; Jean-Christophe Lagier; Didier Raoult
Journal: Clin Microbiol Infect Date: 2021-05-12 Impact factor: 13.310

6. Emergence of genomic diversity and recurrent mutations in SARS-CoV-2.

Authors: Lucy van Dorp; Mislav Acman; Damien Richard; Liam P Shaw; Charlotte E Ford; Louise Ormond; Christopher J Owen; Juanita Pang; Cedric C S Tan; Florencia A T Boshier; Arturo Torres Ortiz; François Balloux
Journal: Infect Genet Evol Date: 2020-05-05 Impact factor: 3.342

7. Genetic grouping of SARS-CoV-2 coronavirus sequences using informative subtype markers for pandemic spread visualization.

Authors: Zhengqiao Zhao; Bahrad A Sokhansanj; Charvi Malhotra; Kitty Zheng; Gail L Rosen
Journal: PLoS Comput Biol Date: 2020-09-17 Impact factor: 4.475

8. Estimated transmissibility and impact of SARS-CoV-2 lineage B.1.1.7 in England.

Authors: Sam Abbott; Rosanna C Barnard; Christopher I Jarvis; Adam J Kucharski; James D Munday; Carl A B Pearson; Timothy W Russell; Damien C Tully; Alex D Washburne; Tom Wenseleers; Nicholas G Davies; Amy Gimma; William Waites; Kerry L M Wong; Kevin van Zandvoort; Justin D Silverman; Karla Diaz-Ordaz; Ruth Keogh; Rosalind M Eggo; Sebastian Funk; Mark Jit; Katherine E Atkins; W John Edmunds
Journal: Science Date: 2021-03-03 Impact factor: 63.714

9. Intra-host evolution during SARS-CoV-2 prolonged infection.

Authors: Carolina M Voloch; Ronaldo da Silva Francisco; Luiz G P de Almeida; Otavio J Brustolini; Cynthia C Cardoso; Alexandra L Gerber; Ana Paula de C Guimarães; Isabela de Carvalho Leitão; Diana Mariani; Victor Akira Ota; Cristiano X Lima; Mauro M Teixeira; Ana Carolina F Dias; Rafael Mello Galliez; Débora Souza Faffe; Luís Cristóvão Pôrto; Renato S Aguiar; Terezinha M P P Castiñeira; Orlando C Ferreira; Amilcar Tanuri; Ana Tereza R de Vasconcelos
Journal: Virus Evol Date: 2021-09-29