Literature DB >> 35169757

Sequencing meets machine learning to fight emerging pathogens: A preview.

Artur Yakimovich1,2,3,4.   

Abstract

In searching for SARS-CoV variants-of-concern, pathogen sequencing is generating an impressive amount of data. However, beyond epidemiological use, these data contain cues fundamental to our understanding of pathogen evolution in the human population. Yet, to harness them, further development of computational methodology, such as machine learning, may be required. This preview discusses updates in machine learning to understand emerging pathogens.
© 2022 The Author.

Entities:  

Year:  2022        PMID: 35169757      PMCID: PMC8832723          DOI: 10.1016/j.patter.2022.100448

Source DB:  PubMed          Journal:  Patterns (N Y)        ISSN: 2666-3899


Main text

With over 300 million confirmed cases to date, SARS-CoV-2 demonstrates the sheer extent to which a pandemic pathogen can transform our interconnected world. However, unlike in many pandemics of the past, the availability of sequencing techniques has profoundly changed the amount of quasi-real-time information we have about the situation unfolding in front of us. The monumental sequencing effort undertaken by the scientific community has led to the accumulation of several million SARS-CoV-2 sequences already, offering an unprecedented research opportunity. However, to tap into this opportunity, the biomedical community is in dire need of a qualitatively new set of tools. Indeed, picking up faint but pivotal patterns within several million SARS-CoV-2 sequences at high speed is an insurmountable task even for a large team of scientists. Instead, the new set of tools required for such tasks should be capable of seamlessly sifting through millions of sequences almost instantaneously. At the same time, these tools must be sensitive and specific enough to uncover yet unknown dependencies without generating false leads. These are exactly the capabilities machine learning (ML) has in store. While ML techniques were used in biology before, recent advances in computational algorithms and hardware have rekindled the popularity of ML within the biomedical field, particularly within infection biology (reviewed in Yakimovich, 2021). Adding to the ML toolset, Park and colleagues recently proposed a novel ML approach to identify discriminative genomic features in SARS-CoV-2 in a metaviromic fashion. In this approach, authors first collated a dataset consisting of 3,665 human and animal coronavirus genomes including SARS-CoV-2, MERS-CoV, and SARS-CoV. Furthermore, by comparing coronaviruses historically associated with higher case fatality rates combined with multiple sequence alignment and knowledge of genomic regions of interest, Park et al. have developed coronavirus pathogenicity (COPA) scores for every nucleotide in the SARS-CoV-2 genome. To achieve this, authors first compared a set of well-established ML algorithms including random forest, support vector machines, Bernoulli naive Bayes, gradient boosting, and multi-layer perceptron to determine their ability to identify genomes associated with pathogenic coronaviruses. Next, Park and colleagues integrated these approaches into a statistical metamodel allowing them to search for pathogenic hotspots within the SARS-CoV-2 genomes. Employing this approach, authors generated 2,473 discriminative hotspots across the SARS-CoV-2 genome and compared them to the known genomic regions of interest. Remarkably, they demonstrate that the hotspots associated with SARS-CoV-2 spike (S) protein overlap with both the infamous furin cleavage site and the contact sites with angiotensin-converting enzyme 2 of the host cell. Another interesting hotspot Park et al. identified using their proposed approach corresponded with amino acid insertions allowing to differentiate betacoronaviruses from alpha- and gammacoronaviruses. Furthermore, authors decided to combine the knowledge of B and T immune-cell epitopes responsible for the immune response generation with the ML-generated hotspot map of the SARS-CoV-2 genome. They noticed that the high COPA pathogenic regions of the S and N proteins significantly overlapped with potential B cell epitopes. Finally, authors looked at the cross-correlation between COPA hotspots and the sequences of the known variants-of-concern. Among other observations, they noticed an overlap of several high-score residues and mutations for variant B.1.1.7, also known as the SARS-CoV-2 Alpha variant. Following these observations, Park and coworkers suggested potential usefulness of their methodology for guiding the design of future SARS-CoV-2 vaccines and epidemiological variant surveillance. Together with the other efforts applying novel ML methodology to pathogen genomics,, the work of Park et al. outlines the immense potential ML can provide to understand and perhaps control future outbreaks. Needless to say, despite the immense sequencing effort undertaken, our coverage of the world of pathogens remains a drop in the ocean. The lion’s share of SARS-CoV-2 surveillance sequencing is performed by precious few nations and laboratories, making our datasets shortsighted at best. Yet, should the current trends in pathogen sequence data collection continue, perhaps in a future powered by a handful of big-data resources like Global Initiative on Sharing Avian Influenza Data (GISAID), ML may drive the new era of infection biology and change our approach to emerging pathogens from reactive to proactive.
  10 in total

Review 1.  Setting the standards for machine learning in biology.

Authors:  David T Jones
Journal:  Nat Rev Mol Cell Biol       Date:  2019-11       Impact factor: 94.444

Review 2.  Machine learning applications in genetics and genomics.

Authors:  Maxwell W Libbrecht; William Stafford Noble
Journal:  Nat Rev Genet       Date:  2015-05-07       Impact factor: 53.242

3.  GISAID: Global initiative on sharing all influenza data - from vision to reality.

Authors:  Yuelong Shu; John McCauley
Journal:  Euro Surveill       Date:  2017-03-30

4.  PACIFIC: a lightweight deep-learning classifier of SARS-CoV-2 and co-infecting RNA viruses.

Authors:  Pablo Acera Mateos; Renzo F Balboa; Simon Easteal; Eduardo Eyras; Hardip R Patel
Journal:  Sci Rep       Date:  2021-02-05       Impact factor: 4.379

5.  COVID-DeepPredictor: Recurrent Neural Network to Predict SARS-CoV-2 and Other Pathogenic Viruses.

Authors:  Indrajit Saha; Nimisha Ghosh; Debasree Maity; Arjit Seal; Dariusz Plewczynski
Journal:  Front Genet       Date:  2021-02-11       Impact factor: 4.599

6.  Metaviromic identification of discriminative genomic features in SARS-CoV-2 using machine learning.

Authors:  Jonathan J Park; Sidi Chen
Journal:  Patterns (N Y)       Date:  2021-11-18

7.  A call for a more comprehensive SARS-CoV-2 sequence database for Brazil.

Authors:  Adriano Abbud; Euclides Ayres Castilho
Journal:  Lancet Reg Health Am       Date:  2021-11-18

Review 8.  Machine learning and its applications to biology.

Authors:  Adi L Tarca; Vincent J Carey; Xue-wen Chen; Roberto Romero; Sorin Drăghici
Journal:  PLoS Comput Biol       Date:  2007-06       Impact factor: 4.475

9.  An interactive web-based dashboard to track COVID-19 in real time.

Authors:  Ensheng Dong; Hongru Du; Lauren Gardner
Journal:  Lancet Infect Dis       Date:  2020-02-19       Impact factor: 25.071

  10 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.