| Literature DB >> 36008967 |
Chris Lauber1, Stefan Seitz2,3.
Abstract
Virus discovery has been fueled by new technologies ever since the first viruses were discovered at the end of the 19th century. Starting with mechanical devices that provided evidence for virus presence in sick hosts, virus discovery gradually transitioned into a sequence-based scientific discipline, which, nowadays, can characterize virus identity and explore viral diversity at an unprecedented resolution and depth. Sequencing technologies are now being used routinely and at ever-increasing scales, producing an avalanche of novel viral sequences found in a multitude of organisms and environments. In this perspective article, we argue that virus discovery has started to undergo another transformation prompted by the emergence of new approaches that are sequence data-centered and primarily computational, setting them apart from previous technology-driven innovations. The data-driven virus discovery approach is largely uncoupled from the collection and processing of biological samples, and exploits the availability of massive amounts of publicly and freely accessible data from sequencing archives. We discuss open challenges to be solved in order to unlock the full potential of data-driven virus discovery, and we highlight the benefits it can bring to classical (mostly molecular) virology and molecular biology in general.Entities:
Keywords: computational virology; data mining; sequencing archives; virosphere in health and disease; virus discovery
Mesh:
Year: 2022 PMID: 36008967 PMCID: PMC9406072 DOI: 10.3390/biom12081073
Source DB: PubMed Journal: Biomolecules ISSN: 2218-273X
Figure 1Size increase in the Sequence Read Archive. Shown is the cumulative amount of the total (yellow) and open access (blue) petabytes deposited in the SRA for each month between April 2008 and May 2022. The points represent the actual amounts and the solid lines show the nonlinear least squares fits of logistic functions that captured the trend of the nonlinear increase considerably better than exponential functions (not shown). Parameters of the fitted curves are detailed in the inlets. The dashed vertical lines indicate the time points at which the amount of open access data doubled relative to the previous doubling time point.